www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Pegged, a Parsing Expression Grammar (PEG) generator in D

reply "Philippe Sigaud" <philippe.sigaud gmail.com> writes:
Hello,

I created a new Github project, Pegged, a Parsing Expression 
Grammar (PEG) generator in D.

https://github.com/PhilippeSigaud/Pegged

docs: https://github.com/PhilippeSigaud/Pegged/wiki

PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

The idea is to give the generator a PEG with the standard syntax. 
 From this grammar definition, a set of related parsers will be 
created, to be used at runtime or compile time.

Usage
-----

To use Pegged, just call the `grammar` function with a PEG and 
mix it in. For example:


import pegged.grammar;

mixin(grammar("
     Expr     <- Factor AddExpr*
     AddExpr  <- ('+'/'-') Factor
     Factor   <- Primary MulExpr*
     MulExpr  <- ('*'/'/') Primary
     Primary  <- Parens / Number / Variable / '-' Primary

     Parens   <- '(' Expr ')'
     Number   <~ [0-9]+
     Variable <- Identifier
"));



This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers 
for basic arithmetic expressions with operator precedence ('*' 
and '/' bind stronger than '+' or '-'). `Identifier` is a 
pre-defined parser recognizing your basic C-style identifier. 
Recursive or mutually recursive rules are OK (no left recursion 
for now).

To use a parser, use the `.parse` method. It will return a parse 
tree containing the calls to the different rules:

// Parsing at compile-time:
enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

pragma(msg, parseTree1.capture);
writeln(parseTree1);

// And at runtime too:
auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



Features
--------

* The complete set of PEG operators are implemented
* Pegged can parse its input at compile time and generate a 
complete parse tree at compile time. In a word: compile-time 
string (read: D code) transformation and generation.
* You can parse at runtime also, you lucky you.
* Use a standard and readable PEG syntax as a DSL, not a bunch of 
templates that hide the parser in noise.
* But you can use expression templates if you want, as parsers 
are all available as such. Pegged is implemented as an expression 
template, and what's good for the library writer is sure OK for 
the user too.
* Some useful additional operators are there too: a way to 
discard matches (thus dumping them from the parse tree), to push 
captures on a stack, to accept matches that are equal to another 
match
* Adding new parsers is easy.
* Grammars are composable: you can put different 
`mixin(grammar(rules));` in a module and then grammars and rules 
can refer to one another. That way, you can have utility grammars 
providing their functionalities to other grammars.
* That's why Pegged comes with some pre-defined grammars (JSON, 
etc).
* Grammars can be dumped in a file to create a D module.

More advanced features, outside the standard PEG perimeter are 
there to bring more power in the mix:

* Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. 
The previous rule defines a parametrized parser taking two other 
parsers (namely, `E` and `Sep`) to match a `Sep`-separated list 
of `E`'s.
* Named captures: any parser can be named with the `=` operator. 
The parse tree generated by the parser (so, also its matches) is 
delivered to the user in the output. Other parsers in the grammar 
see the named captures too.
* Semantic actions can be added to any rule in a grammar. Once a 
rule has matched, its associated action is called on the rule 
output and passed as final result to other parsers further up the 
grammar. Do what you want to the parse tree. If the passed 
actions are delegates, they can access external variables.


Philippe
Mar 10 2012
next sibling parent reply =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful enough to parse a language such as C? -- - Alex
Mar 10 2012
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrej Mitrovic:

 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)

To avoid that bug: http://d.puremagic.com/issues/show_bug.cgi?id=3827 Bye, bearophile
Mar 10 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/10/12 5:56 PM, Andrej Mitrovic wrote:
 I see you are not the only one who started writing string array
 literals like this:

 enum PEGCode = grammarCode!(
       "Grammar<- S Definition+ EOI"
      ,"Definition<- RuleName Arrow Expression"
      ,"RuleName<- Identifier>(ParamList?)"
      ,"Expression<- Sequence (OR Sequence)*"
 );

 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)

 So can you use this to actually parse D code, extract syntax trees and
 stuff like that? I'm clueless about parsing but it looks very neat.
 Nice work!

I, too, think this is very significant work! Suggestion for Philippe: instead of this: enum PEGCode = grammarCode!( "Grammar <- S Definition+ EOI" ,"Definition <- RuleName Arrow Expression" ,"RuleName <- Identifier>(ParamList?)" ,"Expression <- Sequence (OR Sequence)*" ); how about this: enum PEGCode = grammarCode!(" Grammar <- S Definition+ EOI; Definition <- RuleName Arrow Expression; RuleName <- Identifier>(ParamList?); Expression <- Sequence (OR Sequence)*; "); Splitting on ";" is trivial and makes client code considerably easier to play with. I'll get back to this. Andrei
Mar 10 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/11/12 1:35 AM, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:
 Splitting on ";" is trivial and makes client code considerably easier to
 play with.

It's already implemented! No need for ';'

Great! I think you'd be wise to add the ";", otherwise it's impossible to split complex rules on more than one line. Alternatively you could have the complement, a line continuation thingie. Andrei
Mar 10 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/11/12 1:22 AM, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 00:34, Alex Rønne Petersen<xtzgzorex gmail.com> 
wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful
 enough to parse a language such as C?

I think so. But you'd have to do add some semantic action to deal with typedefs and macros. People parsed Java and Javascript with them. I personnally never used Pegged for more than JSON and custom formats, but I intend to try the D grammar.

Any chance you consider adding AST generator actions as discussed in the main forum a while ago? Andrei
Mar 10 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/11/12 1:38 AM, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:

 Any chance you consider adding AST generator actions as discussed in the
 main forum a while ago?

The AST is automatically produced, and there are already AST actions to simplify / guide its creation. There already is an 'ignore that node' syntax and a 'fuse the captures here'. Also, semantic actions are there, so I think the basis is there, I just need to add a few tree actions : cut children, take only some children. I'll add it today. But maybe you had something else in mind? I'll go and search for the thread you refer to.

I was thinking of ANTLR-style operators in which you say where the root should be and you get to drop unnecessary nodes. (Post on 2012/02/29 10:19AM GMT-0600.) Andrei
Mar 10 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/11/12 3:02 AM, Philippe Sigaud wrote:
 There is an operator to drop unnecessary nodes (':').

Good.
 Apart from that,
 a Pegged grammar is a self-contained entity: it automatically cuts
 nodes coming from other grammars, to simplify the tree (it keeps the
 matcheds substrings, of course). I plan to code different type of
 grammars, to play with the way the deal with nodes coming from outside
 rules and grammar composition.

Not getting this, but picture me with an intense look and nodding.
 As for the root, the PEG rule is that the first rule in the grammar is
 the root. But currently, I deliberately made my grammars unsealed, in
 that the user can call any rule inside the grammar to begin parsing:

 mixin(grammar("
      A<- B C
      B<- ...
      C<-
 "));

 A.parse("some input"); // standard way to do it, the parse tree will
 be rooted on an 'A' node.
 B.parse("some input"); // also possible, the parse tree will be rooted
 on a 'B' node.

That sounds cool, but how do you say that in the construct Expression <- Term "+" Term the "+" should become the root? Andrei
Mar 11 2012
prev sibling parent reply =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 08:22, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 00:34, Alex Rønne Petersen<xtzgzorex gmail.com> 
wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful
 enough to parse a language such as C?

I think so. But you'd have to do add some semantic action to deal with typedefs and macros.

Oh, I should have mentioned I only meant the actual language (ignoring the preprocessor). Why do you need semantic actions for typedefs though? Can't you defer resolution of types until after parsing?
 People parsed Java and Javascript with them. I personnally never used
 Pegged for more than JSON and custom formats, but I intend to try the
 D grammar.

-- - Alex
Mar 11 2012
parent reply =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 18:06, Philippe Sigaud wrote:
  >> On Sun, Mar 11, 2012 at 00:34, Alex Rønne
 Petersen<xtzgzorex gmail.com <mailto:xtzgzorex gmail.com>>  wrote:

 [Parsing C?]
  >> I think so. But you'd have to do add some semantic action to deal with
  >> typedefs and macros.
  >
  >
  > Oh, I should have mentioned I only meant the actual language
 (ignoring the preprocessor).

 OK. I admit I downloaded the C spec online, but was a bit taken aback by
 the size of it. mot of it was the definition of the standard library,
 but still...

  > Why do you need semantic actions for typedefs though? Can't you defer
 resolution of types until after parsing?

 Yes, that the way I'd do it. But some people seem to want to do it while
 parsing. Maybe it blocks some parsing, if the parser encounter an
 identifier where there should be a type?

Hm, I don't *think* C has such ambiguities but I could well be wrong. In any case, if it can handle the non-ambiguous case, that's enough for me. :) -- - Alex
Mar 11 2012
parent =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 18:19, Philippe Sigaud wrote:
  > Hm, I don't *think* C has such ambiguities but I could well be wrong.
 In any case, if it can handle the non-ambiguous case, that's enough for me.

 I wanted to tackle D this week, but I might as well begin with C :)

 Do you happen to have any handy and readable EBNF grammar for C? At
 least for D, I have dlang.org <http://dlang.org>.

Yep, see: http://ssw.jku.at/Coco/ (C.atg) Coco/R is more or less EBNF. -- - Alex
Mar 11 2012
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I see you are not the only one who started writing string array
literals like this:

enum PEGCode = grammarCode!(
     "Grammar <- S Definition+ EOI"
    ,"Definition <- RuleName Arrow Expression"
    ,"RuleName   <- Identifier>(ParamList?)"
    ,"Expression <- Sequence (OR Sequence)*"
);

IOW comma on the left side. I know it's not a style preference but
actually a (unfortunate but needed) technique for avoiding bugs. :)

So can you use this to actually parse D code, extract syntax trees and
stuff like that? I'm clueless about parsing but it looks very neat.
Nice work!
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 00:34, Alex R=C3=B8nne Petersen <xtzgzorex gmail.co=
m> wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is this power=

 enough to parse a language such as C?

I think so. But you'd have to do add some semantic action to deal with typedefs and macros. People parsed Java and Javascript with them. I personnally never used Pegged for more than JSON and custom formats, but I intend to try the D grammar.
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 00:56, Andrej Mitrovic
<andrej.mitrovich gmail.com> wrote:
 I see you are not the only one who started writing string array
 literals like this:

 enum PEGCode =3D grammarCode!(
 =C2=A0 =C2=A0 "Grammar <- S Definition+ EOI"
 =C2=A0 =C2=A0,"Definition <- RuleName Arrow Expression"
 =C2=A0 =C2=A0,"RuleName =C2=A0 <- Identifier>(ParamList?)"
 =C2=A0 =C2=A0,"Expression <- Sequence (OR Sequence)*"
 );

 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)

Yes, I use what I call "Haskell-comma" (I saw it first on Haskell code). But in the previous sample, that's old code. Now, that should only be; enum code =3D grammar(" Grammar <- S Definition+ EOI Definition <- RuleName Arrow Expression RuleName =C2=A0 <- Identifier>(ParamList?) Expression <- Sequence (OR Sequence)* "); I see I didn't update bootstrap.d, I'll do that.
 So can you use this to actually parse D code, extract syntax trees and
 stuff like that? I'm clueless about parsing but it looks very neat.

I can partially parse D code, because I only wrote part of the D grammar (see the pegged.examples.dgrammar module), just enough to parse most of template constraints. I intend to complete it and see if that floats. Yes, it extracts syntax trees, at compile-time or runtime.
 Nice work!

Thanks!
Mar 10 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
 Hello,
 
 I created a new Github project, Pegged, a Parsing Expression
 Grammar (PEG) generator in D.

Cool! PEGs aren't used all that much for language grammars - primarily because BNF and EBNF are more traditional I think - but they're definitely cool. And it's great to see a D implementation of a PEG parser. - Jonathan M Davis
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 I, too, think this is very significant work! Suggestion for Philippe:
 instead of this:


 enum PEGCode =3D grammarCode!(
 =C2=A0 =C2=A0 "Grammar <- S Definition+ EOI"
 =C2=A0 =C2=A0,"Definition <- RuleName Arrow Expression"
 =C2=A0 =C2=A0,"RuleName =C2=A0 <- Identifier>(ParamList?)"
 =C2=A0 =C2=A0,"Expression <- Sequence (OR Sequence)*"
 );

 how about this:

 enum PEGCode =3D grammarCode!("
 =C2=A0 =C2=A0Grammar <- S Definition+ EOI;

 =C2=A0 =C2=A0Definition <- RuleName Arrow Expression;
 =C2=A0 =C2=A0RuleName =C2=A0 <- Identifier>(ParamList?);
 =C2=A0 =C2=A0Expression <- Sequence (OR Sequence)*;
 ");

 Splitting on ";" is trivial and makes client code considerably easier to
 play with.

It's already implemented! No need for ';' The docs are up to date, the internal code also. It's only an old bootstrapping file that Andrej cited that still contains multi-string definitions. I'll correct that, as that means I didn"t push dogfooding far enough. Now, just use one string: "Grammar <- ... Definition <- ... "
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Any chance you consider adding AST generator actions as discussed in the
 main forum a while ago?

The AST is automatically produced, and there are already AST actions to simplify / guide its creation. There already is an 'ignore that node' syntax and a 'fuse the captures here'. Also, semantic actions are there, so I think the basis is there, I just need to add a few tree actions : cut children, take only some children. I'll add it today. But maybe you had something else in mind? I'll go and search for the thread you refer to.
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:30, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression
 Grammar (PEG) generator in D.

Cool! PEGs aren't used all that much for language grammars - primarily because BNF and EBNF are more traditional I think - but they're definitely cool. And it's great to see a D implementation of a PEG parser.

Is there an EBNF grammar for D somewhere, I mean outside the dlang docs? I vaguely remember someone trying to write one.
Mar 10 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, March 11, 2012 08:39:47 Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 08:30, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
 Hello,
 
 I created a new Github project, Pegged, a Parsing Expression
 Grammar (PEG) generator in D.

Cool! PEGs aren't used all that much for language grammars - primarily because BNF and EBNF are more traditional I think - but they're definitely cool. And it's great to see a D implementation of a PEG parser.

Is there an EBNF grammar for D somewhere, I mean outside the dlang docs? I vaguely remember someone trying to write one.

Someone may have been trying to write one, but I don't believe that there's an official one, and I recall someone complaining the the BNF grammar was inaccurate in a number of places, but that may have been fixed now, since Walter did a number of documentation fixes a few weeks back. - Jonathan M Davis
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:46, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 Is there an EBNF grammar for D somewhere, I mean outside the dlang docs?


 Someone may have been trying to write one, but I don't believe that there's an
 official one, and I recall someone complaining the the BNF grammar was
 inaccurate in a number of places, but that may have been fixed now, since
 Walter did a number of documentation fixes a few weeks back.

Hmm. I'll post on D.main and see.
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:47, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Great! I think you'd be wise to add the ";", otherwise it's impossible to
 split complex rules on more than one line. Alternatively you could have the
 complement, a line continuation thingie.

Complex rules can be multi-line, I'll put more examples of this in the docs. Now that Pegged is bootstrapped (the grammar generator was generated by Pegged itself), it's easy to change the grammar. Anyway, the separating if done on the rule definitions (Identifier <- ...) A <- B C / C /D # // OK, cut between D and E. E <- New Def You can insert line comments with '#'
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:53, Philippe Sigaud
<philippe.sigaud gmail.com> wrote:

 Anyway, the separating if done on the rule definitions =C2=A0(Identifier =

Aw, I meant, the separation is done on rule definitions. Damn children climbing on me to get more breakfast :-)
Mar 10 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Mar 11, 2012 at 08:51, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 I was thinking of ANTLR-style operators in which you say where the root
 should be and you get to drop unnecessary nodes.

 (Post on 2012/02/29 10:19AM GMT-0600.)

Get it, thanks for the ref! There is an operator to drop unnecessary nodes (':'). Apart from that, a Pegged grammar is a self-contained entity: it automatically cuts nodes coming from other grammars, to simplify the tree (it keeps the matcheds substrings, of course). I plan to code different type of grammars, to play with the way the deal with nodes coming from outside rules and grammar composition. As for the root, the PEG rule is that the first rule in the grammar is the root. But currently, I deliberately made my grammars unsealed, in that the user can call any rule inside the grammar to begin parsing: mixin(grammar(" A <- B C B <- ... C <- ")); A.parse("some input"); // standard way to do it, the parse tree will be rooted on an 'A' node. B.parse("some input"); // also possible, the parse tree will be rooted on a 'B' node. I'll add what I called closed and sealed grammars. I'll also add named grammars and such in the days to come. I can add a root note indication ("unless told otherwise, when asked to parse with grammar G, begin with rule R).
Mar 11 2012
prev sibling next sibling parent =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Question: Are the generated parsers, AST nodes, etc classes or structs? -- - Alex
Mar 11 2012
prev sibling next sibling parent kraybourne <stdin kraybourne.com> writes:
On 3/11/12 24:28 , Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.


 Philippe

Very cool! Quick question, you mention the ability to opt-out of the space-insensitivity, where might one find this? Thanks!
Mar 11 2012
prev sibling next sibling parent reply =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

By the way, bootstrap.d seems to fail to build at the moment: ../pegged/utils/bootstrap.d(1433): found ':' when expecting ')' following template argument list ../pegged/utils/bootstrap.d(1433): members expected ../pegged/utils/bootstrap.d(1433): { } expected following aggregate declaration ../pegged/utils/bootstrap.d(1433): semicolon expected, not '!' ../pegged/utils/bootstrap.d(1433): Declaration expected, not '!' ../pegged/utils/bootstrap.d(1466): unrecognized declaration -- - Alex
Mar 11 2012
next sibling parent =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 16:02, Alex Rønne Petersen wrote:
 On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

By the way, bootstrap.d seems to fail to build at the moment: .../pegged/utils/bootstrap.d(1433): found ':' when expecting ')' following template argument list .../pegged/utils/bootstrap.d(1433): members expected .../pegged/utils/bootstrap.d(1433): { } expected following aggregate declaration .../pegged/utils/bootstrap.d(1433): semicolon expected, not '!' .../pegged/utils/bootstrap.d(1433): Declaration expected, not '!' .../pegged/utils/bootstrap.d(1466): unrecognized declaration

Also, I have sent a pull request to fix the build on 64-bit: https://github.com/PhilippeSigaud/Pegged/pull/1 -- - Alex
Mar 11 2012
prev sibling parent =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 18:17, Philippe Sigaud wrote:
  > By the way, bootstrap.d seems to fail to build at the moment:
  >
  > ../pegged/utils/bootstrap.d(1433): found ':' when expecting ')'
 following template argument list
  > ../pegged/utils/bootstrap.d(1433): members expected
  > ../pegged/utils/bootstrap.d(1433): { } expected following aggregate
 declaration
  > ../pegged/utils/bootstrap.d(1433): semicolon expected, not '!'
  > ../pegged/utils/bootstrap.d(1433): Declaration expected, not '!'
  > ../pegged/utils/bootstrap.d(1466): unrecognized declaration

 Hmm, it compiled for me a few hours ago. I'll see if I broke something
 while pushing.

 I'll also try to make the whole grammar-modification process easier.
 Since users can modify Pegged own grammar, I might as well make that
 fluid and easy to do.

 I'll put the Pegged grammar as a string in a separate module and create
 a function that does the rest: modify the string, it will recompile the
 entire grammar for you.

Is bootstrap.d currently essential to actually use Pegged? -- - Alex
Mar 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--00248c768f42e7649304bafaa5ee
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

 On Sun, Mar 11, 2012 at 00:34, Alex R=C3=B8nne Petersen<xtzgzorex gmail.=


wrote: [Parsing C?]
 I think so. But you'd have to do add some semantic action to deal with
 typedefs and macros.

Oh, I should have mentioned I only meant the actual language (ignoring

OK. I admit I downloaded the C spec online, but was a bit taken aback by the size of it. mot of it was the definition of the standard library, but still...
 Why do you need semantic actions for typedefs though? Can't you defer

Yes, that the way I'd do it. But some people seem to want to do it while parsing. Maybe it blocks some parsing, if the parser encounter an identifier where there should be a type? --00248c768f42e7649304bafaa5ee Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <p>&gt;&gt; On Sun, Mar 11, 2012 at 00:34, Alex R=C3=B8nne Petersen&lt;<a h= ref=3D"mailto:xtzgzorex gmail.com">xtzgzorex gmail.com</a>&gt; =C2=A0wrote:= </p> <p>[Parsing C?]<br> &gt;&gt; I think so. But you&#39;d have to do add some semantic action to d= eal with<br> &gt;&gt; typedefs and macros.<br> &gt;<br> &gt;<br> &gt; Oh, I should have mentioned I only meant the actual language (ignoring= the preprocessor).</p> <p>OK. I admit I downloaded the C spec online, but was a bit taken aback by= the size of it. mot of it was the definition of the standard library, but = still...<br></p> <p>&gt; Why do you need semantic actions for typedefs though? Can&#39;t you= defer resolution of types until after parsing?</p> <p>Yes, that the way I&#39;d do it. But some people seem to want to do it w= hile parsing. Maybe it blocks some parsing, if the parser encounter an iden= tifier where there should be a type?<br></p> --00248c768f42e7649304bafaa5ee--
Mar 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--20cf300fa95388104004bafab109
Content-Type: text/plain; charset=UTF-8

alex:
 Question: Are the generated parsers, AST nodes, etc classes or structs?

They are structs. See: https://github.com/PhilippeSigaud/Pegged/wiki/Parse-Trees --20cf300fa95388104004bafab109 Content-Type: text/html; charset=UTF-8 <p>alex:<br> &gt; Question: Are the generated parsers, AST nodes, etc classes or structs?</p> <p>They are structs. See:</p> <p><a href="https://github.com/PhilippeSigaud/Pegged/wiki/Parse-Trees">https://github.com/PhilippeSigaud/Pegged/wiki/Parse-Trees</a></p> --20cf300fa95388104004bafab109--
Mar 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--20cf3074bae46cbb3004bafabe04
Content-Type: text/plain; charset=UTF-8

 Quick question, you mention the ability to opt-out of the

Yes, undocumented. Use the '>' operator. You know, I introduced space-insensitivity recently, to simplify some rules and it keeps biting me back. For example Line <- (!EOL .)* EOL The catch is, the (!EOL .) sequence accepts spaces (so, line terminators) between the !EOL and the . Crap. So, I keep writing Line <- (!EOL > .)* EOL And I'm more and more convinced that ws a bbad move on my part. Or, at least, give the user a way to opt-out for an entire rule. --20cf3074bae46cbb3004bafabe04 Content-Type: text/html; charset=UTF-8 <p><br> &gt; Quick question, you mention the ability to opt-out of the space-insensitivity, where might one find this?<br></p> <p>Yes, undocumented. Use the &#39;&gt;&#39; operator.</p> <p>You know, I introduced space-insensitivity recently, to simplify some rules and it keeps biting me back.</p> <p>For example</p> <p>Line &lt;- (!EOL .)* EOL<br></p> <p>The catch is, the (!EOL .) sequence accepts spaces (so, line terminators) between the !EOL and the .</p> <p>Crap.</p> <p>So, I keep writing</p> <p>Line &lt;- (!EOL &gt; .)* EOL</p> <p>And I&#39;m more and more convinced that ws a bbad move on my part. Or, at least, give the user a way to opt-out for an entire rule.</p> <p> </p> --20cf3074bae46cbb3004bafabe04--
Mar 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--20cf300fb055831d1e04bafac14b
Content-Type: text/plain; charset=UTF-8

 Also, I have sent a pull request to fix the build on 64-bit:

Merged, thanks! --20cf300fb055831d1e04bafac14b Content-Type: text/html; charset=UTF-8 <p>&gt; Also, I have sent a pull request to fix the build on 64-bit: <a href="https://github.com/PhilippeSigaud/Pegged/pull/1">https://github.com/PhilippeSigaud/Pegged/pull/1</a><br></p> <p>Merged, thanks!</p> --20cf300fb055831d1e04bafac14b--
Mar 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--bcaec51a8cc0b1b10404bafacc46
Content-Type: text/plain; charset=UTF-8

 By the way, bootstrap.d seems to fail to build at the moment:

 ../pegged/utils/bootstrap.d(1433): found ':' when expecting ')' following

 ../pegged/utils/bootstrap.d(1433): members expected
 ../pegged/utils/bootstrap.d(1433): { } expected following aggregate

 ../pegged/utils/bootstrap.d(1433): semicolon expected, not '!'
 ../pegged/utils/bootstrap.d(1433): Declaration expected, not '!'
 ../pegged/utils/bootstrap.d(1466): unrecognized declaration

Hmm, it compiled for me a few hours ago. I'll see if I broke something while pushing. I'll also try to make the whole grammar-modification process easier. Since users can modify Pegged own grammar, I might as well make that fluid and easy to do. I'll put the Pegged grammar as a string in a separate module and create a function that does the rest: modify the string, it will recompile the entire grammar for you. --bcaec51a8cc0b1b10404bafacc46 Content-Type: text/html; charset=UTF-8 <p>&gt; By the way, bootstrap.d seems to fail to build at the moment:<br> &gt;<br> &gt; ../pegged/utils/bootstrap.d(1433): found &#39;:&#39; when expecting &#39;)&#39; following template argument list<br> &gt; ../pegged/utils/bootstrap.d(1433): members expected<br> &gt; ../pegged/utils/bootstrap.d(1433): { } expected following aggregate declaration<br> &gt; ../pegged/utils/bootstrap.d(1433): semicolon expected, not &#39;!&#39;<br> &gt; ../pegged/utils/bootstrap.d(1433): Declaration expected, not &#39;!&#39;<br> &gt; ../pegged/utils/bootstrap.d(1466): unrecognized declaration</p> <p>Hmm, it compiled for me a few hours ago. I&#39;ll see if I broke something while pushing.</p> <p>I&#39;ll also try to make the whole grammar-modification process easier. Since users can modify Pegged own grammar, I might as well make that fluid and easy to do.</p> <p>I&#39;ll put the Pegged grammar as a string in a separate module and create a function that does the rest: modify the string, it will recompile the entire grammar for you.<br><br></p> --bcaec51a8cc0b1b10404bafacc46--
Mar 11 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
--20cf300fa9536cda2204bafad5ba
Content-Type: text/plain; charset=UTF-8

 Hm, I don't *think* C has such ambiguities but I could well be wrong. In

I wanted to tackle D this week, but I might as well begin with C :) Do you happen to have any handy and readable EBNF grammar for C? At least for D, I have dlang.org. --20cf300fa9536cda2204bafad5ba Content-Type: text/html; charset=UTF-8 <p>&gt; Hm, I don&#39;t *think* C has such ambiguities but I could well be wrong. In any case, if it can handle the non-ambiguous case, that&#39;s enough for me. </p> <p>I wanted to tackle D this week, but I might as well begin with C :)</p> <p>Do you happen to have any handy and readable EBNF grammar for C? At least for D, I have <a href="http://dlang.org">dlang.org</a>.</p> --20cf300fa9536cda2204bafad5ba--
Mar 11 2012
prev sibling next sibling parent reply =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Hm, since ' is used in the grammar of Pegged, how do I express it in my grammar spec? Is there a predefined rule for it, or is \' supposed to work? -- - Alex
Mar 11 2012
parent =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 12-03-2012 02:27, Alex Rønne Petersen wrote:
 On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Hm, since ' is used in the grammar of Pegged, how do I express it in my grammar spec? Is there a predefined rule for it, or is \' supposed to work?

Ah, Quote it is. I also noticed there is DoubleQuote. Is " used for anything in the Pegged grammar? -- - Alex
Mar 11 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:

Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful enough to parse a language such as C?

I've just read a few articles referenced from this page, and the second link was by someone who had done java 1.5, the second link http://bford.info/packrat/ http://www.romanredz.se/papers/FI2007.pdf It is interesting but that article left me with some questions about the implementation in order to make it useful for my needs. I had done an experiment with mvel expression evaluation in java and gotten good improvements relative to homebrew expression evaluators. However, the mvel expressions are missing the ability to express array operations clearly, which is something that is very clear in D, and my particular need is to enable the user to express array operations. With this pegged embedded parser, it appears to me you could provide a group of context symbols as part of a language definition, similar to providing a list of reserved words, so that the parsing of the user's expression would also validate the symbols used. Also, I've been reading David Simcha's parallel_algorithm.d, here: https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorithm.d and in the parallelArrayOp portion, he has presented a way to turn the D array expressions into code that is executed in parallel on multicore systems. That is something I'd want to use, but the manual lexing requirement is a bit clunky, and the rules are unclear to me, so it seems to me a combination with the pegged parser could make that more accessible. Thanks, Jay
Mar 12 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Tuesday, 13 March 2012 at 05:25:38 UTC, Jay Norwood wrote:

Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful enough to parse a language such as C?

I've just read a few articles referenced from this page, and the second link was by someone who had done java 1.5, the second link http://bford.info/packrat/ http://www.romanredz.se/papers/FI2007.pdf

Also in the later paper he did a C parser, so I suppose that is the answer ... http://www.romanredz.se/papers/FI2008.pdf
Mar 12 2012
prev sibling next sibling parent reply bls <bizprac orange.fr> writes:
On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C A = [B]. A <- B? A = {B}. A <- B* Having EBNF expressed in Pegged ... EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression }'/ '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Literal <- ("'" .+ "'" / '"' .+ '"') Due to the fact that EBNF can be expressed in EBNF it should be possible to parse an arbitrary EBNF file and generate PEG output. What do you think ? BTW.. Is my EBNF PEG description correct ? TIA Bjoern
Mar 12 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C

Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa -- Dmitry Olshansky
Mar 13 2012
parent reply bls <bizprac orange.fr> writes:
On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
 On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C

Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa


My mistake.. cleaned up stuff.. Pegged Wirth EBNF Sequence A <- B C A = BC. B or C A <- B / C A = B|C. Zero or one B A <- B? A = [B]. Zero or more Bs A <- B* A = {B}. One or more Bs A <- B+ Not available PEG description of EBNF EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Literal <- ("'" .+ "'") / ('"' .+ '"') Still not sure if this is correct. Especially : Term <- Factor Factor* Another thing I never really understand is the "production" order, In other words : Why not top down .. Start : lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* .... End : EBNF <- Procuction+ where End is Root.. TIA, Bjoern
Mar 12 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 12.03.2012 17:45, bls wrote:
 On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
 On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C

Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa


PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on. In an example I give B is always picked first and so C is never ever looked at. Somewhat less artificial example: Literal <- IntL| FloatL FloatL <- [0-9]+(.[0-9]+)? IntL <- [0-9]+ If you change it to: Literal <- FloatL| IntL then integer literals would get parsed as floating point.

My mistake.. cleaned up stuff.. Pegged Wirth EBNF Sequence A <- B C A = BC. B or C A <- B / C A = B|C. Zero or one B A <- B? A = [B]. Zero or more Bs A <- B* A = {B}. One or more Bs A <- B+ Not available PEG description of EBNF EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

Why not: Identifier <- [a-zA-Z]+
 Literal <- ("'" .+ "'") / ('"' .+ '"')

 Still not sure if this is correct. Especially :
 Term <- Factor Factor*


 Another thing I never really understand is the "production" order, In
 other words : Why not top down ..
 Start :
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

 ....

 End :
 EBNF <- Procuction+

 where End is Root..

In fact grammars are usually devised the other way around, e.g. Start: Program -> ... Ehm... what the whole program is exactly ? Ok, let it be Declaration* for now. What kind of declarations do we have? ... and so on. Latter grammars get tweaked and extended numerous times. At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric).
 TIA, Bjoern

-- Dmitry Olshansky
Mar 13 2012
next sibling parent =?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:
On 13-03-2012 17:17, Dmitry Olshansky wrote:
 On 12.03.2012 17:45, bls wrote:
 On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
 On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C

Maybe A <- B / C. And even then it's not exactly equivalent if the grammar was ambiguous. Imagine: B <- a, C <- aa


PEG defines order of alternatives, that is pretty much like a top-down recursive descent parser would parse it. Alternatives are tried from left to right, if first one fails, it tries next and so on. In an example I give B is always picked first and so C is never ever looked at. Somewhat less artificial example: Literal <- IntL| FloatL FloatL <- [0-9]+(.[0-9]+)? IntL <- [0-9]+ If you change it to: Literal <- FloatL| IntL then integer literals would get parsed as floating point.

My mistake.. cleaned up stuff.. Pegged Wirth EBNF Sequence A <- B C A = BC. B or C A <- B / C A = B|C. Zero or one B A <- B? A = [B]. Zero or more Bs A <- B* A = {B}. One or more Bs A <- B+ Not available PEG description of EBNF EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' / '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

Why not: Identifier <- [a-zA-Z]+

That was an illustrative example from the Pegged docs. But yeah, you should just use a range; reads nicer.
 Literal <- ("'" .+ "'") / ('"' .+ '"')

 Still not sure if this is correct. Especially :
 Term <- Factor Factor*


 Another thing I never really understand is the "production" order, In
 other words : Why not top down ..
 Start :
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

 ....

 End :
 EBNF <- Procuction+

 where End is Root..

In fact grammars are usually devised the other way around, e.g. Start: Program -> ... Ehm... what the whole program is exactly ? Ok, let it be Declaration* for now. What kind of declarations do we have? ... and so on. Latter grammars get tweaked and extended numerous times. At any rate production order has no effect on the grammar, it's still the same. The only thing of importance is what non-terminal considered final (or start if you are LL-centric).
 TIA, Bjoern


-- - Alex
Mar 13 2012
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 14.03.2012 2:49, Philippe Sigaud wrote:
 On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky<dmitry.olsh gmail.com>  wrote:

 PEG defines order of alternatives, that is pretty much like a top-down
 recursive descent parser would parse it. Alternatives are tried from left to
 right, if first one fails, it tries next and so on.

Yes.
 In an example I give B is always picked first and so C is never ever looked
 at.

That's one of the caveats on PEG. That and greedy operators. 'a*a' never succeeds because 'a*' consumes all the available a's.

Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' finally fails?
 Literal<- ("'" .+ "'") / ('"' .+ '"')


This is an example were PEG would munch anything till the end, and fail (since """ is not found in an empty string). The standard PEG way to do that is (!EndMarker .)+ EndMarker It's a common enough pattern I tend to define a parameterized rule for that: Until(Expr)<- (!Expr .)* Expr Gotta love parametrized rules! 15' to code, a neverending stream of joy.

Nice!
 In fact grammars are usually devised the other way around, e.g.
 Start:
   Program ->  ...
 Ehm... what the whole program is exactly ? Ok, let it be Declaration* for
 now. What kind of declarations do we have? ... and so on. Latter grammars
 get tweaked and extended numerous times.

 At any rate production order has no effect on the grammar, it's still the
 same. The only thing of importance is what non-terminal considered final (or
 start if you are LL-centric).

Yes. The PEG standard seem to be that the first rule is the start (root) rule. Pegged let you decide which rule you use for a start. The rest is automatic. I might change that.

-- Dmitry Olshansky
Mar 14 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 17.03.2012 10:59, Philippe Sigaud wrote:
 On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky<dmitry.olsh gmail.com>  wrote:

 That's one of the caveats on PEG. That and greedy operators.

 'a*a' never succeeds because 'a*' consumes all the available a's.

Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' finally fails?

PEG only do local backtracking in 'or' choices, not in sequences. I think that's because the original author was dealing with packrat parsing and its O(input-size) guarantee.

Ok, let's agree on fact that semantically a* is : As <- a As / a and a*? is this: As <- a / a As Now that's local ;)
 I remember trying compile-time backtracking in another similar library
 in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
 but I don't know the consequences. How do you do that in std.regex?

There are two fundamental ways to get around the problem, the easiest to implement is to use a stack to record a position in input + number of alternative/production (I call it a PC - program counter) every time there is a "fork" like that. Then when the current path ultimately fails pop stack, reset input and go over again until you either match or empty the stack. That's how to avoid deeeep recursion that happens here. The problematic thing now is combinatorial explosion of a* a* a* .... //simplified, but you get the idea The trick to avoid this sort of crap is to 1) agree that whatever matches first has higher priority (left-most match) that's a simple disambiguation rule. 2) record what positions + pc you place on stack e.g. use a set, or in fact, a bitmap to push every unique combination (pc,index) only once. Voila! Now you have a linear worst-case algorithm with (M*N) complexity where M is number of all possible PCs, and N is number of all possible indexes in input. Now if we put aside all crap talk about mathematical purity and monads, and you name it, a packrat is essentially this, a dynamic programming trick applied to recursive top-down parsing. The trick could be extended even more to avoid extra checks on input characters, but the essence is this "memoization".
 (nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular. Might as well keep me busy fixing bugs :) -- Dmitry Olshansky
Mar 17 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 17.03.2012 18:11, Philippe Sigaud wrote:
 On Sat, Mar 17, 2012 at 10:09, Dmitry Olshansky<dmitry.olsh gmail.com>  wrote:

 Ok, let's agree on fact that semantically
 a* is :
 As<- a As / a

 and a*? is this:
 As<- a / a As

 Now that's local ;)

It's local, yes. But the pb is with Expr<- A* B C D, when D fails. PEG sequences don't backtrack.

I'd argue they do. As I see it as: Expr <- As B C D / B C D As <- A / A As (or use an epsilon production for As, is it allowed in pegged ?)
 I remember trying compile-time backtracking in another similar library
 in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
 but I don't know the consequences. How do you do that in std.regex?

There are two fundamental ways to get around the problem

Thanks for the explanations. Does std.regex implement this?

It does the second way that I haven't told about, and had hard time to implement in C-T :) And the simple version of what I presented (i.e. stack stuff) is used in CT-regex and as fallback in R-T. The problem with doing a bitmap memoization is that regex often used to _search_ things in large inputs. Some compromise and/or thresholds have to be estimated and found. Parsing on the contrary guaranties that the whole input is used or in error, hence bitmap is not wasted.
 Now if we put aside all crap talk about mathematical purity and monads, and
 you name it, a packrat is essentially this, a dynamic programming trick
 applied to recursive top-down parsing. The trick could be extended even more
 to avoid extra checks on input characters, but the essence is this
 "memoization".

I plan to store indices anyway. I might add this in the future. A read something on using PEGs to parse Java and C and there was no need to do a complete memoization: storing the last parse result as a temporary seemed to be enough.

Well even straight no-memorization is somehow enough, it's just a way to ensure no extra work is done. C & Java are not much of backtracky grammars.
 (nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular. Might as well keep me busy fixing bugs :)

As people said on Reddit, you should make more whooping sounds about CT-regex. That was what wowed me and pushed me to tackle CT-parsing again.

It's just I'm also well aware of how much luck it (still) takes to fly ;) The asserts that compare C-T vs R-T go off too often for my taste, other then this the memory usage during compile is too high. I recall once dmd had GC re-enabled for brief period, dmd used no more then ~64Mb doing real crazy ct-regex stuff. OK, back to C-T parsing, I have one crazy idea that I can't get away from - add operator precedence grammar into the mix. From what I observe it should integrate into PEG smoothly. Then it would make "military-grade" hybrid that uses operator precedence parsing of expressions and the like. Few more tricks and it may beat some existing parser generators. See this post, where I tried to describe that idea early on: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=159562 I might catch spare time to go about adding it myself, the only tricky thing is to embed plain semantic actions, as AST generation would be more or less the same. -- Dmitry Olshansky
Mar 17 2012
prev sibling parent =?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:
On 12-03-2012 13:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW! Nice to have on your WIKI would be a EBNF to PEG sheet. Wirth EBNF Pegged A = BC. A <- B C A = B|C. A <- C / C A = [B]. A <- B? A = {B}. A <- B* Having EBNF expressed in Pegged ... EBNF <- Procuction+ Production <- Identifier '=' Expression '.' Expression <- Term ( '|' Term)* Term <- Factor Factor* Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression }'/ '(' Expression ')' lowerCase <- [a-z] upperCase <- [A-Z] Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)* Literal <- ("'" .+ "'" / '"' .+ '"') Due to the fact that EBNF can be expressed in EBNF it should be possible to parse an arbitrary EBNF file and generate PEG output. What do you think ? BTW.. Is my EBNF PEG description correct ? TIA Bjoern

The thing is, PEG cannot represent an ambiguous grammar. It is fully ordered, so it's just plain impossible. That's not to say you can't turn an EBNF grammar into PEG, but it won't necessarily be useful in all cases. -- - Alex
Mar 13 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
I am impressed. That's a really nice showcase for the D compile 
time features.

Can I use PEG to parse languages like python and haskell where 
indention matters without preprocessing?

Will you make it work with input ranges of dchar? So that I can 
easily plug in some preprocessing steps?
Mar 13 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Mar 13, 2012 at 18:05, Alex R=C3=B8nne Petersen <xtzgzorex gmail.co=
m> wrote:

 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

Why not: Identifier <- [a-zA-Z]+

That was an illustrative example from the Pegged docs. But yeah, you shou=

 just use a range; reads nicer.

The docs are for teaching PEG :) (btw, it's the docs describe C-like identifiers, that's why I chose a longer approach) It's always this 'tension', between inlining and refactoring. [a-zA-Z]+ is shorter and more readable. But If you decide to extend your grammar to UTF-32, it'd be easier to just change the 'letter' rule.
Mar 13 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky <dmitry.olsh gmail.com> wro=
te:

 PEG defines order of alternatives, that is pretty much like a top-down
 recursive descent parser would parse it. Alternatives are tried from left=

 right, if first one fails, it tries next and so on.

Yes.
 In an example I give B is always picked first and so C is never ever look=

 at.

That's one of the caveats on PEG. That and greedy operators. 'a*a' never succeeds because 'a*' consumes all the available a's.
 Literal <- ("'" .+ "'") / ('"' .+ '"')


This is an example were PEG would munch anything till the end, and fail (since """ is not found in an empty string). The standard PEG way to do that is (!EndMarker .)+ EndMarker It's a common enough pattern I tend to define a parameterized rule for that= : Until(Expr) <- (!Expr .)* Expr Gotta love parametrized rules! 15' to code, a neverending stream of joy.
 In fact grammars are usually devised the other way around, e.g.
 Start:
 =C2=A0Program -> ...
 Ehm... what the whole program is exactly ? Ok, let it be Declaration* for
 now. What kind of declarations do we have? ... and so on. Latter grammars
 get tweaked and extended numerous times.

 At any rate production order has no effect on the grammar, it's still the
 same. The only thing of importance is what non-terminal considered final =

 start if you are LL-centric).

Yes. The PEG standard seem to be that the first rule is the start (root) rule. Pegged let you decide which rule you use for a start. The rest is automatic. I might change that.
Mar 13 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Mar 12, 2012 at 13:43, bls <bizprac orange.fr> wrote:

 Just WOW!

Thanks! Don't be too excited, it's still quite slow as a parser. But that is a fun project :)
 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF =C2=A0 =C2=A0 =C2=A0Pegged
 A =3D BC. =C2=A0 =C2=A0 =C2=A0 =C2=A0 A <- B C
 A =3D B|C. =C2=A0 =C2=A0 =C2=A0 =C2=A0A <- C / C
 A =3D [B]. =C2=A0 =C2=A0 =C2=A0 =C2=A0A <- B?
 A =3D {B}. =C2=A0 =C2=A0 =C2=A0 =C2=A0A <- B*

fact is, I don't know EBNF that much. I basically learned everything I know about parsing or grammars while coding Pegged in February :) I probably made every mistakes in the book. Hey, it's a github public wiki, I guess you can create a new page?
Mar 13 2012
prev sibling next sibling parent reply "Jay Norwood" <jayn prismnet.com> writes:
On Saturday, 10 March 2012 at 23:28:42 UTC, Philippe Sigaud wrote:
 * Grammars can be dumped in a file to create a D module.

In reading the D spec I've seen a few instance where there are infered items such as auto for variables and various parameters in templates. I presume your D grammar will have to have some rules to be able to infer these as well. It seems like it would not be a big step to output what are the infered proper statements as the .capture output. Is that correct? I think that would be helpful in some cases to be able to view the fully expanded expression.
Mar 14 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 03/15/2012 12:09 AM, Jay Norwood wrote:
 On Saturday, 10 March 2012 at 23:28:42 UTC, Philippe Sigaud wrote:
 * Grammars can be dumped in a file to create a D module.

In reading the D spec I've seen a few instance where there are infered items such as auto for variables and various parameters in templates. I presume your D grammar will have to have some rules to be able to infer these as well. It seems like it would not be a big step to output what are the infered proper statements as the .capture output. Is that correct? I think that would be helpful in some cases to be able to view the fully expanded expression.

This is not what a parser does. Resolving symbols and assigning types to expressions is part of semantic analysis.
Mar 15 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 That's one of the caveats on PEG. That and greedy operators.

 'a*a' never succeeds because 'a*' consumes all the available a's.

Hey, wait, I thought there has to be backtrack here, i.e. when second 'a' finally fails?

PEG only do local backtracking in 'or' choices, not in sequences. I think that's because the original author was dealing with packrat parsing and its O(input-size) guarantee. I remember trying compile-time backtracking in another similar library in D 1-2 years ago and getting lots of pb. I might add that in Pegged, but I don't know the consequences. How do you do that in std.regex? (nice article btw, I learnt some about regexes)
Mar 16 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sat, Mar 17, 2012 at 10:09, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 Ok, let's agree on fact that semantically
 a* is :
 As <- a As / a

 and a*? is this:
 As <- a / a As

 Now that's local ;)

It's local, yes. But the pb is with Expr <- A* B C D, when D fails. PEG sequences don't backtrack.
 I remember trying compile-time backtracking in another similar library
 in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
 but I don't know the consequences. How do you do that in std.regex?

There are two fundamental ways to get around the problem

Thanks for the explanations. Does std.regex implement this?
 Now if we put aside all crap talk about mathematical purity and monads, and
 you name it, a packrat is essentially this, a dynamic programming trick
 applied to recursive top-down parsing. The trick could be extended even more
 to avoid extra checks on input characters, but the essence is this
 "memoization".

I plan to store indices anyway. I might add this in the future. A read something on using PEGs to parse Java and C and there was no need to do a complete memoization: storing the last parse result as a temporary seemed to be enough.
 (nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular. Might as well keep me busy fixing bugs :)

As people said on Reddit, you should make more whooping sounds about CT-regex. That was what wowed me and pushed me to tackle CT-parsing again.
Mar 17 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sat, Mar 17, 2012 at 15:48, Dmitry Olshansky <dmitry.olsh gmail.com> wro=
te:

 PEG sequences don't backtrack.

I'd argue they do. As I see it as: Expr <- As B C D / B C D As <- A / A As

That's what people doing Regex-to-PEG translations do, yes. But it's not the spontaneous behavior of A* B C D in PEG. But that means I could add a switch to transform the expressions that way.
 (or use an epsilon production for As, is it allowed in pegged ?)

I called it 'Eps', it's a predefined parser that always matches and consumes nothing. I used the greek epsilon at the beginning (=CE=B5) but thought that many people would shout at this :)
 OK, back to C-T parsing, I have one crazy idea that I can't get away from=

 add operator precedence grammar into the mix. From what I observe it shou=

 integrate into PEG smoothly. Then it would make "military-grade" hybrid t=

 uses operator precedence parsing of expressions and the like. Few more
 tricks and it may beat some
 existing parser generators.

 See this post, where I tried to describe that idea early on:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=

I remember reading this. But I don't feel I'm up to it for now.
 I might catch spare time to go about adding it myself, the only tricky th=

 is to embed plain semantic actions, as AST generation would be more or le=

 the same.

Cool!
Mar 17 2012
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 11 Mar 2012 00:28:42 +0100, Philippe Sigaud  
<philippe.sigaud gmail.com> wrote:

 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar  
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax.  From  
 this grammar definition, a set of related parsers will be created, to be  
 used at runtime or compile time.

Just wanted to draw your attention on a pull request of mine. https://github.com/D-Programming-Language/dmd/pull/426 Using an added option (-M|Mf|Md) dmd will gather all mixin strings into a file. This is very helpful for better error messages and debugging.
Mar 22 2012