digitalmars.D.announce - Pegged, a Parsing Expression Grammar (PEG) generator in D

Philippe Sigaud (80/80) Mar 10 2012 Hello,

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (5/80) Mar 10 2012 Admittedly I have not heard of PEGs before, so I'm curious: Is this

Andrej Mitrovic (13/13) Mar 10 2012 I see you are not the only one who started writing string array

bearophile (5/7) Mar 10 2012 To avoid that bug:
Andrei Alexandrescu (20/33) Mar 10 2012 I, too, think this is very significant work! Suggestion for Philippe:

Philippe Sigaud (11/28) Mar 10 2012 It's already implemented! No need for ';'

Andrei Alexandrescu (5/10) Mar 10 2012 Great! I think you'd be wise to add the ";", otherwise it's impossible

Philippe Sigaud (11/14) Mar 10 2012 Complex rules can be multi-line, I'll put more examples of this in the d...
Philippe Sigaud (5/6) Mar 10 2012 <- ...)

Philippe Sigaud (8/10) Mar 10 2012 ful

Andrei Alexandrescu (4/12) Mar 10 2012 Any chance you consider adding AST generator actions as discussed in the...

Philippe Sigaud (11/13) Mar 10 2012 The AST is automatically produced, and there are already AST actions

Andrei Alexandrescu (5/18) Mar 10 2012 I was thinking of ANTLR-style operators in which you say where the root

Philippe Sigaud (25/28) Mar 11 2012 Get it, thanks for the ref!

Andrei Alexandrescu (7/26) Mar 11 2012 Not getting this, but picture me with an intense look and nodding.

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (7/15) Mar 11 2012 Oh, I should have mentioned I only meant the actual language (ignoring

Philippe Sigaud (11/16) Mar 11 2012 com>

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (5/22) Mar 11 2012 Hm, I don't *think* C has such ambiguities but I could well be wrong. In...

Philippe Sigaud (4/5) Mar 11 2012 any case, if it can handle the non-ambiguous case, that's enough for me.

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (5/10) Mar 11 2012 Yep, see: http://ssw.jku.at/Coco/ (C.atg)

Philippe Sigaud (18/31) Mar 10 2012 Yes, I use what I call "Haskell-comma" (I saw it first on Haskell
Jay Norwood (28/31) Mar 12 2012 I've just read a few articles referenced from this page, and the

Jay Norwood (4/13) Mar 12 2012 Also in the later paper he did a C parser, so I suppose that is

Jonathan M Davis (5/9) Mar 10 2012 Cool! PEGs aren't used all that much for language grammars - primarily b...
Philippe Sigaud (3/11) Mar 10 2012 Is there an EBNF grammar for D somewhere, I mean outside the dlang docs?
Jonathan M Davis (6/21) Mar 10 2012 Someone may have been trying to write one, but I don't believe that ther...
Philippe Sigaud (2/7) Mar 10 2012 Hmm. I'll post on D.main and see.
=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (4/79) Mar 11 2012 Question: Are the generated parsers, AST nodes, etc classes or structs?

Philippe Sigaud (3/4) Mar 11 2012 They are structs. See:
Tom Compton (3/5) Aug 08 2013 Why is this important?

kraybourne (5/9) Mar 11 2012 Very cool!

Philippe Sigaud (13/14) Mar 11 2012 space-insensitivity, where might one find this?

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (12/87) Mar 11 2012 By the way, bootstrap.d seems to fail to build at the moment:

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (5/117) Mar 11 2012 Also, I have sent a pull request to fix the build on 64-bit:

Philippe Sigaud (2/3) Mar 11 2012 https://github.com/PhilippeSigaud/Pegged/pull/1

Philippe Sigaud (10/17) Mar 11 2012 Hmm, it compiled for me a few hours ago. I'll see if I broke something

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (4/22) Mar 11 2012 Is bootstrap.d currently essential to actually use Pegged?

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (5/80) Mar 11 2012 Hm, since ' is used in the grammar of Pegged, how do I express it in my

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= (5/110) Mar 11 2012 Ah, Quote it is. I also noticed there is DoubleQuote. Is " used for

bls (26/31) Mar 12 2012 Just WOW!

Dmitry Olshansky (6/20) Mar 13 2012 Maybe A <- B / C. And even then it's not exactly equivalent if the

bls (38/59) Mar 12 2012 PEG is pretty new to me. Can you elaborate a bit ?

Dmitry Olshansky (26/87) Mar 13 2012 PEG defines order of alternatives, that is pretty much like a top-down

=?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= (5/105) Mar 13 2012 That was an illustrative example from the Pegged docs. But yeah, you

Philippe Sigaud (9/18) Mar 13 2012 ld

Philippe Sigaud (20/37) Mar 13 2012 to

Dmitry Olshansky (6/39) Mar 14 2012 Hey, wait, I thought there has to be backtrack here, i.e. when second

Philippe Sigaud (8/14) Mar 16 2012 PEG only do local backtracking in 'or' choices, not in sequences. I

Dmitry Olshansky (32/47) Mar 17 2012 Ok, let's agree on fact that semantically

Philippe Sigaud (12/30) Mar 17 2012 It's local, yes. But the pb is with Expr <- A* B C D, when D fails.

Dmitry Olshansky (33/70) Mar 17 2012 I'd argue they do. As I see it as:

Philippe Sigaud (17/33) Mar 17 2012 That's what people doing Regex-to-PEG translations do, yes. But it's

=?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= (7/41) Mar 13 2012 The thing is, PEG cannot represent an ambiguous grammar. It is fully
Philippe Sigaud (7/14) Mar 13 2012 Thanks! Don't be too excited, it's still quite slow as a parser. But

Tobias Pankrath (6/6) Mar 13 2012 I am impressed. That's a really nice showcase for the D compile
Jay Norwood (9/10) Mar 14 2012 In reading the D spec I've seen a few instance where there are

Timon Gehr (3/14) Mar 15 2012 This is not what a parser does. Resolving symbols and assigning types to...

Martin Nowak (7/16) Mar 22 2012 Just wanted to draw your attention on a pull request of mine.
Elie Morisse (9/9) Aug 07 2013 Hi Philippe,

"Philippe Sigaud" <philippe.sigaud gmail.com> writes:

Hello,

I created a new Github project, Pegged, a Parsing Expression 
Grammar (PEG) generator in D.

https://github.com/PhilippeSigaud/Pegged

docs: https://github.com/PhilippeSigaud/Pegged/wiki

PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

The idea is to give the generator a PEG with the standard syntax. 
 From this grammar definition, a set of related parsers will be 
created, to be used at runtime or compile time.

Usage
-----

To use Pegged, just call the `grammar` function with a PEG and 
mix it in. For example:


import pegged.grammar;

mixin(grammar("
     Expr     <- Factor AddExpr*
     AddExpr  <- ('+'/'-') Factor
     Factor   <- Primary MulExpr*
     MulExpr  <- ('*'/'/') Primary
     Primary  <- Parens / Number / Variable / '-' Primary

     Parens   <- '(' Expr ')'
     Number   <~ [0-9]+
     Variable <- Identifier
"));



This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers 
for basic arithmetic expressions with operator precedence ('*' 
and '/' bind stronger than '+' or '-'). `Identifier` is a 
pre-defined parser recognizing your basic C-style identifier. 
Recursive or mutually recursive rules are OK (no left recursion 
for now).

To use a parser, use the `.parse` method. It will return a parse 
tree containing the calls to the different rules:

// Parsing at compile-time:
enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

pragma(msg, parseTree1.capture);
writeln(parseTree1);

// And at runtime too:
auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



Features
--------

* The complete set of PEG operators are implemented
* Pegged can parse its input at compile time and generate a 
complete parse tree at compile time. In a word: compile-time 
string (read: D code) transformation and generation.
* You can parse at runtime also, you lucky you.
* Use a standard and readable PEG syntax as a DSL, not a bunch of 
templates that hide the parser in noise.
* But you can use expression templates if you want, as parsers 
are all available as such. Pegged is implemented as an expression 
template, and what's good for the library writer is sure OK for 
the user too.
* Some useful additional operators are there too: a way to 
discard matches (thus dumping them from the parse tree), to push 
captures on a stack, to accept matches that are equal to another 
match
* Adding new parsers is easy.
* Grammars are composable: you can put different 
`mixin(grammar(rules));` in a module and then grammars and rules 
can refer to one another. That way, you can have utility grammars 
providing their functionalities to other grammars.
* That's why Pegged comes with some pre-defined grammars (JSON, 
etc).
* Grammars can be dumped in a file to create a D module.

More advanced features, outside the standard PEG perimeter are 
there to bring more power in the mix:

* Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. 
The previous rule defines a parametrized parser taking two other 
parsers (namely, `E` and `Sep`) to match a `Sep`-separated list 
of `E`'s.
* Named captures: any parser can be named with the `=` operator. 
The parse tree generated by the parser (so, also its matches) is 
delivered to the user in the output. Other parsers in the grammar 
see the named captures too.
* Semantic actions can be added to any rule in a grammar. Once a 
rule has matched, its associated action is called on the rule 
output and passed as final result to other parsers further up the 
grammar. Do what you want to the parse tree. If the passed 
actions are delegates, they can access external variables.


Philippe

Mar 10 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Admittedly I have not heard of PEGs before, so I'm curious: Is this 
powerful enough to parse a language such as C?

-- 
- Alex

Mar 10 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I see you are not the only one who started writing string array
literals like this:

enum PEGCode = grammarCode!(
     "Grammar <- S Definition+ EOI"
    ,"Definition <- RuleName Arrow Expression"
    ,"RuleName   <- Identifier>(ParamList?)"
    ,"Expression <- Sequence (OR Sequence)*"
);

IOW comma on the left side. I know it's not a style preference but
actually a (unfortunate but needed) technique for avoiding bugs. :)

So can you use this to actually parse D code, extract syntax trees and
stuff like that? I'm clueless about parsing but it looks very neat.
Nice work!

Mar 10 2012

bearophile <bearophileHUGS lycos.com> writes:

Andrej Mitrovic:

 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)

To avoid that bug:
http://d.puremagic.com/issues/show_bug.cgi?id=3827

Bye,
bearophile

Mar 10 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/10/12 5:56 PM, Andrej Mitrovic wrote:
 I see you are not the only one who started writing string array
 literals like this:

 enum PEGCode = grammarCode!(
       "Grammar<- S Definition+ EOI"
      ,"Definition<- RuleName Arrow Expression"
      ,"RuleName<- Identifier>(ParamList?)"
      ,"Expression<- Sequence (OR Sequence)*"
 );

 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)

 So can you use this to actually parse D code, extract syntax trees and
 stuff like that? I'm clueless about parsing but it looks very neat.
 Nice work!

I, too, think this is very significant work! Suggestion for Philippe: 
instead of this:

enum PEGCode = grammarCode!(
      "Grammar <- S Definition+ EOI"
     ,"Definition <- RuleName Arrow Expression"
     ,"RuleName   <- Identifier>(ParamList?)"
     ,"Expression <- Sequence (OR Sequence)*"
);

how about this:

enum PEGCode = grammarCode!("
     Grammar <- S Definition+ EOI;
     Definition <- RuleName Arrow Expression;
     RuleName   <- Identifier>(ParamList?);
     Expression <- Sequence (OR Sequence)*;
");

Splitting on ";" is trivial and makes client code considerably easier to 
play with.

I'll get back to this.


Andrei

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 I, too, think this is very significant work! Suggestion for Philippe:
 instead of this:


 enum PEGCode =3D grammarCode!(
 =C2=A0 =C2=A0 "Grammar <- S Definition+ EOI"
 =C2=A0 =C2=A0,"Definition <- RuleName Arrow Expression"
 =C2=A0 =C2=A0,"RuleName =C2=A0 <- Identifier>(ParamList?)"
 =C2=A0 =C2=A0,"Expression <- Sequence (OR Sequence)*"
 );

 how about this:

 enum PEGCode =3D grammarCode!("
 =C2=A0 =C2=A0Grammar <- S Definition+ EOI;

 =C2=A0 =C2=A0Definition <- RuleName Arrow Expression;
 =C2=A0 =C2=A0RuleName =C2=A0 <- Identifier>(ParamList?);
 =C2=A0 =C2=A0Expression <- Sequence (OR Sequence)*;
 ");

 Splitting on ";" is trivial and makes client code considerably easier to
 play with.

It's already implemented! No need for ';'
The docs are up to date, the internal code also. It's only an old
bootstrapping file that Andrej cited that still contains multi-string
definitions. I'll correct that, as that means I didn"t push dogfooding
far enough.

Now, just use one string:

"Grammar <- ...
 Definition <- ...

"

Mar 10 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/11/12 1:35 AM, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:
 Splitting on ";" is trivial and makes client code considerably easier to
 play with.

 It's already implemented! No need for ';'

Great! I think you'd be wise to add the ";", otherwise it's impossible 
to split complex rules on more than one line. Alternatively you could 
have the complement, a line continuation thingie.


Andrei

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:47, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Great! I think you'd be wise to add the ";", otherwise it's impossible to
 split complex rules on more than one line. Alternatively you could have the
 complement, a line continuation thingie.

Complex rules can be multi-line, I'll put more examples of this in the docs.

Now that Pegged is bootstrapped (the grammar generator was generated
by Pegged itself), it's easy to change the grammar.

Anyway, the separating if done on the rule definitions  (Identifier <- ...)

A <- B   C
     / C

E <- New Def

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:53, Philippe Sigaud
<philippe.sigaud gmail.com> wrote:

 Anyway, the separating if done on the rule definitions =C2=A0(Identifier =

<- ...)

Aw, I meant, the separation is done on rule definitions. Damn children
climbing on me to get more breakfast :-)

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 00:34, Alex R=C3=B8nne Petersen <xtzgzorex gmail.co=
m> wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is this power=

ful
 enough to parse a language such as C?

I think so. But you'd have to do add some semantic action to deal with
typedefs and macros.

People parsed Java and Javascript with them. I personnally never used
Pegged for more than JSON and custom formats, but I intend to try the
D grammar.

Mar 10 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/11/12 1:22 AM, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 00:34, Alex Rønne Petersen<xtzgzorex gmail.com> 
wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful
 enough to parse a language such as C?

 I think so. But you'd have to do add some semantic action to deal with
 typedefs and macros.

 People parsed Java and Javascript with them. I personnally never used
 Pegged for more than JSON and custom formats, but I intend to try the
 D grammar.

Any chance you consider adding AST generator actions as discussed in the 
main forum a while ago?

Andrei

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 Any chance you consider adding AST generator actions as discussed in the
 main forum a while ago?

The AST is automatically produced, and there are already AST actions
to simplify / guide its creation. There already is an 'ignore that
node' syntax and a 'fuse the captures here'.
Also, semantic actions are there, so I think the basis is there, I
just need to add a few tree actions : cut children, take only some
children.

I'll add it today.

But maybe you had something else in mind? I'll go and search for the
thread you refer to.

Mar 10 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/11/12 1:38 AM, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 08:26, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:

 Any chance you consider adding AST generator actions as discussed in the
 main forum a while ago?

 The AST is automatically produced, and there are already AST actions
 to simplify / guide its creation. There already is an 'ignore that
 node' syntax and a 'fuse the captures here'.
 Also, semantic actions are there, so I think the basis is there, I
 just need to add a few tree actions : cut children, take only some
 children.

 I'll add it today.

 But maybe you had something else in mind? I'll go and search for the
 thread you refer to.

I was thinking of ANTLR-style operators in which you say where the root 
should be and you get to drop unnecessary nodes.

(Post on 2012/02/29 10:19AM GMT-0600.)


Andrei

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:51, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 I was thinking of ANTLR-style operators in which you say where the root
 should be and you get to drop unnecessary nodes.

 (Post on 2012/02/29 10:19AM GMT-0600.)

Get it, thanks for the ref!

There is an operator to drop unnecessary nodes (':'). Apart from that,
a Pegged grammar is a self-contained entity: it automatically cuts
nodes coming from other grammars, to simplify the tree (it keeps the
matcheds substrings, of course). I plan to code different type of
grammars, to play with the way the deal with nodes coming from outside
rules and grammar composition.

As for the root, the PEG rule is that the first rule in the grammar is
the root. But currently, I deliberately made my grammars unsealed, in
that the user can call any rule inside the grammar to begin parsing:

mixin(grammar("
    A <- B C
    B <- ...
    C <-
"));

A.parse("some input"); // standard way to do it, the parse tree will
be rooted on an 'A' node.
B.parse("some input"); // also possible, the parse tree will be rooted
on a 'B' node.

I'll add what I called closed and sealed grammars. I'll also add named
grammars and such in the days to come. I can add a root note
indication ("unless told otherwise, when asked to parse with grammar
G, begin with rule R).

Mar 11 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/11/12 3:02 AM, Philippe Sigaud wrote:
 There is an operator to drop unnecessary nodes (':').

Good.

 Apart from that,
 a Pegged grammar is a self-contained entity: it automatically cuts
 nodes coming from other grammars, to simplify the tree (it keeps the
 matcheds substrings, of course). I plan to code different type of
 grammars, to play with the way the deal with nodes coming from outside
 rules and grammar composition.

Not getting this, but picture me with an intense look and nodding.

 As for the root, the PEG rule is that the first rule in the grammar is
 the root. But currently, I deliberately made my grammars unsealed, in
 that the user can call any rule inside the grammar to begin parsing:

 mixin(grammar("
      A<- B C
      B<- ...
      C<-
 "));

 A.parse("some input"); // standard way to do it, the parse tree will
 be rooted on an 'A' node.
 B.parse("some input"); // also possible, the parse tree will be rooted
 on a 'B' node.

That sounds cool, but how do you say that in the construct

Expression <- Term "+" Term

the "+" should become the root?


Andrei

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 08:22, Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 00:34, Alex Rønne Petersen<xtzgzorex gmail.com> 
wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is this powerful
 enough to parse a language such as C?

 I think so. But you'd have to do add some semantic action to deal with
 typedefs and macros.

Oh, I should have mentioned I only meant the actual language (ignoring 
the preprocessor).

Why do you need semantic actions for typedefs though? Can't you defer 
resolution of types until after parsing?

 People parsed Java and Javascript with them. I personnally never used
 Pegged for more than JSON and custom formats, but I intend to try the
 D grammar.

-- 
- Alex

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 On Sun, Mar 11, 2012 at 00:34, Alex R=C3=B8nne Petersen<xtzgzorex gmail.=


com>
 wrote:

[Parsing C?]
 I think so. But you'd have to do add some semantic action to deal with
 typedefs and macros.


 Oh, I should have mentioned I only meant the actual language (ignoring

the preprocessor).

OK. I admit I downloaded the C spec online, but was a bit taken aback by
the size of it. mot of it was the definition of the standard library, but
still...

 Why do you need semantic actions for typedefs though? Can't you defer

resolution of types until after parsing?

Yes, that the way I'd do it. But some people seem to want to do it while
parsing. Maybe it blocks some parsing, if the parser encounter an
identifier where there should be a type?

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 18:06, Philippe Sigaud wrote:
  >> On Sun, Mar 11, 2012 at 00:34, Alex Rønne
 Petersen<xtzgzorex gmail.com <mailto:xtzgzorex gmail.com>>  wrote:

 [Parsing C?]
  >> I think so. But you'd have to do add some semantic action to deal with
  >> typedefs and macros.
  >
  >
  > Oh, I should have mentioned I only meant the actual language
 (ignoring the preprocessor).

 OK. I admit I downloaded the C spec online, but was a bit taken aback by
 the size of it. mot of it was the definition of the standard library,
 but still...

  > Why do you need semantic actions for typedefs though? Can't you defer
 resolution of types until after parsing?

 Yes, that the way I'd do it. But some people seem to want to do it while
 parsing. Maybe it blocks some parsing, if the parser encounter an
 identifier where there should be a type?

Hm, I don't *think* C has such ambiguities but I could well be wrong. In 
any case, if it can handle the non-ambiguous case, that's enough for me. :)

-- 
- Alex

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 Hm, I don't *think* C has such ambiguities but I could well be wrong. In

any case, if it can handle the non-ambiguous case, that's enough for me.

I wanted to tackle D this week, but I might as well begin with C :)

Do you happen to have any handy and readable EBNF grammar for C? At least
for D, I have dlang.org.

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 18:19, Philippe Sigaud wrote:
  > Hm, I don't *think* C has such ambiguities but I could well be wrong.
 In any case, if it can handle the non-ambiguous case, that's enough for me.

 I wanted to tackle D this week, but I might as well begin with C :)

 Do you happen to have any handy and readable EBNF grammar for C? At
 least for D, I have dlang.org <http://dlang.org>.

Yep, see: http://ssw.jku.at/Coco/ (C.atg)

Coco/R is more or less EBNF.

-- 
- Alex

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 00:56, Andrej Mitrovic
<andrej.mitrovich gmail.com> wrote:
 I see you are not the only one who started writing string array
 literals like this:

 enum PEGCode =3D grammarCode!(
 =C2=A0 =C2=A0 "Grammar <- S Definition+ EOI"
 =C2=A0 =C2=A0,"Definition <- RuleName Arrow Expression"
 =C2=A0 =C2=A0,"RuleName =C2=A0 <- Identifier>(ParamList?)"
 =C2=A0 =C2=A0,"Expression <- Sequence (OR Sequence)*"
 );

 IOW comma on the left side. I know it's not a style preference but
 actually a (unfortunate but needed) technique for avoiding bugs. :)

Yes, I use what I call "Haskell-comma" (I saw it first on Haskell
code). But in the previous sample, that's old code. Now, that should
only be;

enum code =3D grammar("
    Grammar <- S Definition+ EOI
    Definition <- RuleName Arrow Expression
    RuleName =C2=A0 <- Identifier>(ParamList?)
    Expression <- Sequence (OR Sequence)*
");

I see I didn't update bootstrap.d, I'll do that.


 So can you use this to actually parse D code, extract syntax trees and
 stuff like that? I'm clueless about parsing but it looks very neat.

I can partially parse D code, because I only wrote part of the D
grammar (see the pegged.examples.dgrammar module), just enough to
parse most of template constraints. I intend to complete it and see if
that floats.

Yes, it extracts syntax trees, at compile-time or runtime.

 Nice work!

Thanks!

Mar 10 2012

"Jay Norwood" <jayn prismnet.com> writes:


 Admittedly I have not heard of PEGs before, so I'm curious: Is 
 this powerful enough to parse a language such as C?

I've just read a few articles referenced from this page, and the 
second link was by someone who had done java 1.5, the second link
http://bford.info/packrat/
http://www.romanredz.se/papers/FI2007.pdf

It is interesting but that article left me with some questions 
about the implementation in order to make it useful for my needs.

I had done an experiment with mvel expression evaluation in java 
and gotten good improvements relative to homebrew expression 
evaluators.

However, the mvel expressions are missing the ability to express 
array operations clearly, which is something that is very clear 
in D, and my particular need is to enable the user to express 
array operations.

With this pegged embedded parser, it appears to me you could 
provide a group of  context symbols as part of a language 
definition, similar to providing a list of reserved words, so 
that the parsing of the user's expression would also validate the 
symbols used.

Also, I've been reading David Simcha's parallel_algorithm.d, here:
https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorithm.d
and in the parallelArrayOp portion, he has presented a way to 
turn the D array expressions into code that is executed in 
parallel on multicore systems. That is something I'd want to use, 
but the manual lexing requirement is a bit clunky, and the rules 
are unclear to me,  so it seems to me a combination with the 
pegged parser could make that more accessible.
Thanks,
Jay

Mar 12 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Tuesday, 13 March 2012 at 05:25:38 UTC, Jay Norwood wrote:

 Admittedly I have not heard of PEGs before, so I'm curious: Is 
 this powerful enough to parse a language such as C?

 I've just read a few articles referenced from this page, and 
 the second link was by someone who had done java 1.5, the 
 second link
 http://bford.info/packrat/
 http://www.romanredz.se/papers/FI2007.pdf

Also in the later paper he did a C parser, so I suppose that is 
the answer ...

http://www.romanredz.se/papers/FI2008.pdf

Mar 12 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
 Hello,
 
 I created a new Github project, Pegged, a Parsing Expression
 Grammar (PEG) generator in D.

Cool! PEGs aren't used all that much for language grammars - primarily because 
BNF and EBNF are more traditional I think - but they're definitely cool. And 
it's great to see a D implementation of a PEG parser.

- Jonathan M Davis

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:30, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression
 Grammar (PEG) generator in D.

 Cool! PEGs aren't used all that much for language grammars - primarily because
 BNF and EBNF are more traditional I think - but they're definitely cool. And
 it's great to see a D implementation of a PEG parser.

Is there an EBNF grammar for D somewhere, I mean outside the dlang docs?

I vaguely remember someone trying to write one.

Mar 10 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, March 11, 2012 08:39:47 Philippe Sigaud wrote:
 On Sun, Mar 11, 2012 at 08:30, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Sunday, March 11, 2012 00:28:42 Philippe Sigaud wrote:
 Hello,
 
 I created a new Github project, Pegged, a Parsing Expression
 Grammar (PEG) generator in D.

 
 Cool! PEGs aren't used all that much for language grammars - primarily
 because BNF and EBNF are more traditional I think - but they're
 definitely cool. And it's great to see a D implementation of a PEG
 parser.

 
 Is there an EBNF grammar for D somewhere, I mean outside the dlang docs?
 
 I vaguely remember someone trying to write one.

Someone may have been trying to write one, but I don't believe that there's an 
official one, and I recall someone complaining the the BNF grammar was 
inaccurate in a number of places, but that may have been fixed now, since 
Walter did a number of documentation fixes a few weeks back.

- Jonathan M Davis

Mar 10 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Mar 11, 2012 at 08:46, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 Is there an EBNF grammar for D somewhere, I mean outside the dlang docs?


 Someone may have been trying to write one, but I don't believe that there's an
 official one, and I recall someone complaining the the BNF grammar was
 inaccurate in a number of places, but that may have been fixed now, since
 Walter did a number of documentation fixes a few weeks back.

Hmm. I'll post on D.main and see.

Mar 10 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Question: Are the generated parsers, AST nodes, etc classes or structs?

-- 
- Alex

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

alex:
 Question: Are the generated parsers, AST nodes, etc classes or structs?

They are structs. See:

https://github.com/PhilippeSigaud/Pegged/wiki/Parse-Trees

Mar 11 2012

"Tom Compton" <orgolone yahoo.com> writes:

On Sunday, 11 March 2012 at 12:40:34 UTC, Alex Rønne Petersen
wrote:

 Question: Are the generated parsers, AST nodes, etc classes or 
 structs?

Why is this important?

Aug 08 2013

kraybourne <stdin kraybourne.com> writes:

On 3/11/12 24:28 , Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.


 Philippe

Very cool!

Quick question, you mention the ability to opt-out of the 
space-insensitivity, where might one find this?

Thanks!

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 Quick question, you mention the ability to opt-out of the

space-insensitivity, where might one find this?

Yes, undocumented. Use the '>' operator.

You know, I introduced space-insensitivity recently, to simplify some rules
and it keeps biting me back.

For example

Line <- (!EOL .)* EOL

The catch is, the (!EOL .) sequence accepts spaces (so, line terminators)
between the !EOL and the .

Crap.

So, I keep writing

Line <- (!EOL > .)* EOL

And I'm more and more convinced that ws a bbad move on my part. Or, at
least, give the user a way to opt-out for an entire rule.

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

By the way, bootstrap.d seems to fail to build at the moment:

../pegged/utils/bootstrap.d(1433): found ':' when expecting ')' 
following template argument list
../pegged/utils/bootstrap.d(1433): members expected
../pegged/utils/bootstrap.d(1433): { } expected following aggregate 
declaration
../pegged/utils/bootstrap.d(1433): semicolon expected, not '!'
../pegged/utils/bootstrap.d(1433): Declaration expected, not '!'
../pegged/utils/bootstrap.d(1466): unrecognized declaration

-- 
- Alex

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 16:02, Alex Rønne Petersen wrote:
 On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

 By the way, bootstrap.d seems to fail to build at the moment:

 .../pegged/utils/bootstrap.d(1433): found ':' when expecting ')'
 following template argument list
 .../pegged/utils/bootstrap.d(1433): members expected
 .../pegged/utils/bootstrap.d(1433): { } expected following aggregate
 declaration
 .../pegged/utils/bootstrap.d(1433): semicolon expected, not '!'
 .../pegged/utils/bootstrap.d(1433): Declaration expected, not '!'
 .../pegged/utils/bootstrap.d(1466): unrecognized declaration

Also, I have sent a pull request to fix the build on 64-bit: 
https://github.com/PhilippeSigaud/Pegged/pull/1

-- 
- Alex

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 Also, I have sent a pull request to fix the build on 64-bit:

https://github.com/PhilippeSigaud/Pegged/pull/1

Merged, thanks!

Mar 11 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 By the way, bootstrap.d seems to fail to build at the moment:

 ../pegged/utils/bootstrap.d(1433): found ':' when expecting ')' following

template argument list
 ../pegged/utils/bootstrap.d(1433): members expected
 ../pegged/utils/bootstrap.d(1433): { } expected following aggregate

declaration
 ../pegged/utils/bootstrap.d(1433): semicolon expected, not '!'
 ../pegged/utils/bootstrap.d(1433): Declaration expected, not '!'
 ../pegged/utils/bootstrap.d(1466): unrecognized declaration

Hmm, it compiled for me a few hours ago. I'll see if I broke something
while pushing.

I'll also try to make the whole grammar-modification process easier. Since
users can modify Pegged own grammar, I might as well make that fluid and
easy to do.

I'll put the Pegged grammar as a string in a separate module and create a
function that does the rest: modify the string, it will recompile the
entire grammar for you.

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 18:17, Philippe Sigaud wrote:
  > By the way, bootstrap.d seems to fail to build at the moment:
  >
  > ../pegged/utils/bootstrap.d(1433): found ':' when expecting ')'
 following template argument list
  > ../pegged/utils/bootstrap.d(1433): members expected
  > ../pegged/utils/bootstrap.d(1433): { } expected following aggregate
 declaration
  > ../pegged/utils/bootstrap.d(1433): semicolon expected, not '!'
  > ../pegged/utils/bootstrap.d(1433): Declaration expected, not '!'
  > ../pegged/utils/bootstrap.d(1466): unrecognized declaration

 Hmm, it compiled for me a few hours ago. I'll see if I broke something
 while pushing.

 I'll also try to make the whole grammar-modification process easier.
 Since users can modify Pegged own grammar, I might as well make that
 fluid and easy to do.

 I'll put the Pegged grammar as a string in a separate module and create
 a function that does the rest: modify the string, it will recompile the
 entire grammar for you.

Is bootstrap.d currently essential to actually use Pegged?

-- 
- Alex

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

Hm, since ' is used in the grammar of Pegged, how do I express it in my 
grammar spec? Is there a predefined rule for it, or is \' supposed to work?

-- 
- Alex

Mar 11 2012

=?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:

On 12-03-2012 02:27, Alex Rønne Petersen wrote:
 On 11-03-2012 00:28, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax. From
 this grammar definition, a set of related parsers will be created, to be
 used at runtime or compile time.

 Usage
 -----

 To use Pegged, just call the `grammar` function with a PEG and mix it
 in. For example:


 import pegged.grammar;

 mixin(grammar("
 Expr <- Factor AddExpr*
 AddExpr <- ('+'/'-') Factor
 Factor <- Primary MulExpr*
 MulExpr <- ('*'/'/') Primary
 Primary <- Parens / Number / Variable / '-' Primary

 Parens <- '(' Expr ')'
 Number <~ [0-9]+
 Variable <- Identifier
 "));



 This creates the `Expr`, `AddExpr`, `Factor` (and so on) parsers for
 basic arithmetic expressions with operator precedence ('*' and '/' bind
 stronger than '+' or '-'). `Identifier` is a pre-defined parser
 recognizing your basic C-style identifier. Recursive or mutually
 recursive rules are OK (no left recursion for now).

 To use a parser, use the `.parse` method. It will return a parse tree
 containing the calls to the different rules:

 // Parsing at compile-time:
 enum parseTree1 = Expr.parse("1 + 2 - (3*x-5)*6");

 pragma(msg, parseTree1.capture);
 writeln(parseTree1);

 // And at runtime too:
 auto parseTree2 = Expr.parse(" 0 + 123 - 456 ");
 assert(parseTree2.capture == ["0", "+", "123", "-", "456"]);



 Features
 --------

 * The complete set of PEG operators are implemented
 * Pegged can parse its input at compile time and generate a complete
 parse tree at compile time. In a word: compile-time string (read: D
 code) transformation and generation.
 * You can parse at runtime also, you lucky you.
 * Use a standard and readable PEG syntax as a DSL, not a bunch of
 templates that hide the parser in noise.
 * But you can use expression templates if you want, as parsers are all
 available as such. Pegged is implemented as an expression template, and
 what's good for the library writer is sure OK for the user too.
 * Some useful additional operators are there too: a way to discard
 matches (thus dumping them from the parse tree), to push captures on a
 stack, to accept matches that are equal to another match
 * Adding new parsers is easy.
 * Grammars are composable: you can put different
 `mixin(grammar(rules));` in a module and then grammars and rules can
 refer to one another. That way, you can have utility grammars providing
 their functionalities to other grammars.
 * That's why Pegged comes with some pre-defined grammars (JSON, etc).
 * Grammars can be dumped in a file to create a D module.

 More advanced features, outside the standard PEG perimeter are there to
 bring more power in the mix:

 * Parametrized rules: `List(E, Sep) <- E (Sep E)*` is possible. The
 previous rule defines a parametrized parser taking two other parsers
 (namely, `E` and `Sep`) to match a `Sep`-separated list of `E`'s.
 * Named captures: any parser can be named with the `=` operator. The
 parse tree generated by the parser (so, also its matches) is delivered
 to the user in the output. Other parsers in the grammar see the named
 captures too.
 * Semantic actions can be added to any rule in a grammar. Once a rule
 has matched, its associated action is called on the rule output and
 passed as final result to other parsers further up the grammar. Do what
 you want to the parse tree. If the passed actions are delegates, they
 can access external variables.


 Philippe

 Hm, since ' is used in the grammar of Pegged, how do I express it in my
 grammar spec? Is there a predefined rule for it, or is \' supposed to work?

Ah, Quote it is. I also noticed there is DoubleQuote. Is " used for 
anything in the Pegged grammar?

-- 
- Alex

Mar 11 2012

bls <bizprac orange.fr> writes:

On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

Just WOW!

Nice to have on your WIKI would be a EBNF to PEG sheet.

Wirth EBNF     	Pegged
A = BC.		A <- B C
A = B|C.	A <- C / C
A = [B].	A <- B?
A = {B}.	A <- B*


Having EBNF expressed in Pegged ...

EBNF <- Procuction+
Production <- Identifier '=' Expression '.'
Expression <- Term ( '|' Term)*
Term <- Factor Factor*
Factor <- 	Identifier /
		Literal /
		'[' Expression ']'
		 / '{' Expression }'/
		'(' Expression ')'
lowerCase  <- [a-z]
upperCase  <- [A-Z]
Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
Literal <- ("'" .+ "'" /  '"' .+ '"')

Due to the fact that EBNF can be expressed in EBNF it should be possible 
to parse an arbitrary EBNF file and generate PEG output.
What do you think ?

BTW.. Is my EBNF PEG description correct ? TIA Bjoern

Mar 12 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 Just WOW!

 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF Pegged
 A = BC. A <- B C
 A = B|C. A <- C / C

Maybe A <- B / C. And even then it's not exactly equivalent if the 
grammar was ambiguous.
Imagine: B <- a, C <- aa

-- 
Dmitry Olshansky

Mar 13 2012

bls <bizprac orange.fr> writes:

On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
 On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 Just WOW!

 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF Pegged
 A = BC. A <- B C
 A = B|C. A <- C / C

 Maybe A <- B / C. And even then it's not exactly equivalent if the
 grammar was ambiguous.
 Imagine: B <- a, C <- aa

PEG is pretty new to me. Can you elaborate a bit ?

My mistake.. cleaned up stuff..

Pegged				Wirth EBNF

Sequence
A <- B C			A = BC.

B or C
A <- B / C			A = B|C.

Zero or one B
A <- B?				A = [B].

Zero or more Bs
A <- B*				A = {B}.

One or more Bs
A <- B+				Not available

PEG description of EBNF

EBNF <- Procuction+
Production <- Identifier '=' Expression '.'
Expression <- Term ( '|' Term)*
Term <- Factor Factor*
Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}' 
/ '(' Expression ')'
lowerCase  <- [a-z]
upperCase  <- [A-Z]
Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
Literal <- ("'" .+ "'") /  ('"' .+ '"')

Still not sure if this is correct. Especially :
Term <- Factor Factor*


Another thing I never really understand is the "production" order, In 
other words : Why not top down ..
Start :
lowerCase  <- [a-z]
upperCase  <- [A-Z]
Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
....

End :
EBNF <- Procuction+

where End is Root..

TIA, Bjoern

Mar 12 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 12.03.2012 17:45, bls wrote:
 On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
 On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 Just WOW!

 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF Pegged
 A = BC. A <- B C
 A = B|C. A <- C / C

 Maybe A <- B / C. And even then it's not exactly equivalent if the
 grammar was ambiguous.
 Imagine: B <- a, C <- aa

 PEG is pretty new to me. Can you elaborate a bit ?

PEG defines order of alternatives, that is pretty much like a top-down 
recursive descent parser would parse it. Alternatives are tried from 
left to right, if first one fails, it tries next and so on.
In an example I give B is always picked first and so C is never ever 
looked at.

Somewhat less artificial example:
Literal <- IntL| FloatL
FloatL <- [0-9]+(.[0-9]+)?
IntL <- [0-9]+

If you change it to: Literal <- FloatL| IntL then integer literals would 
get parsed as floating point.


 My mistake.. cleaned up stuff..

 Pegged Wirth EBNF

 Sequence
 A <- B C A = BC.

 B or C
 A <- B / C A = B|C.

 Zero or one B
 A <- B? A = [B].

 Zero or more Bs
 A <- B* A = {B}.

 One or more Bs
 A <- B+ Not available

 PEG description of EBNF

 EBNF <- Procuction+
 Production <- Identifier '=' Expression '.'
 Expression <- Term ( '|' Term)*
 Term <- Factor Factor*
 Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}'
 / '(' Expression ')'
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

Why not:
Identifier <- [a-zA-Z]+

 Literal <- ("'" .+ "'") / ('"' .+ '"')

This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

 Still not sure if this is correct. Especially :
 Term <- Factor Factor*


 Another thing I never really understand is the "production" order, In
 other words : Why not top down ..
 Start :
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

 ....

 End :
 EBNF <- Procuction+

 where End is Root..

In fact grammars are usually devised the other way around, e.g.
Start:
  Program -> ...
Ehm... what the whole program is exactly ? Ok, let it be Declaration* 
for now. What kind of declarations do we have? ... and so on. Latter 
grammars get tweaked and extended numerous times.

At any rate production order has no effect on the grammar, it's still 
the same. The only thing of importance is what non-terminal considered 
final (or start if you are LL-centric).

 TIA, Bjoern


-- 
Dmitry Olshansky

Mar 13 2012

=?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:

On 13-03-2012 17:17, Dmitry Olshansky wrote:
 On 12.03.2012 17:45, bls wrote:
 On 03/13/2012 04:28 AM, Dmitry Olshansky wrote:
 On 12.03.2012 16:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 Just WOW!

 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF Pegged
 A = BC. A <- B C
 A = B|C. A <- C / C

 Maybe A <- B / C. And even then it's not exactly equivalent if the
 grammar was ambiguous.
 Imagine: B <- a, C <- aa

 PEG is pretty new to me. Can you elaborate a bit ?

 PEG defines order of alternatives, that is pretty much like a top-down
 recursive descent parser would parse it. Alternatives are tried from
 left to right, if first one fails, it tries next and so on.
 In an example I give B is always picked first and so C is never ever
 looked at.

 Somewhat less artificial example:
 Literal <- IntL| FloatL
 FloatL <- [0-9]+(.[0-9]+)?
 IntL <- [0-9]+

 If you change it to: Literal <- FloatL| IntL then integer literals would
 get parsed as floating point.


 My mistake.. cleaned up stuff..

 Pegged Wirth EBNF

 Sequence
 A <- B C A = BC.

 B or C
 A <- B / C A = B|C.

 Zero or one B
 A <- B? A = [B].

 Zero or more Bs
 A <- B* A = {B}.

 One or more Bs
 A <- B+ Not available

 PEG description of EBNF

 EBNF <- Procuction+
 Production <- Identifier '=' Expression '.'
 Expression <- Term ( '|' Term)*
 Term <- Factor Factor*
 Factor <- Identifier / Literal / '[' Expression ']' / '{' Expression '}'
 / '(' Expression ')'
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

 Why not:
 Identifier <- [a-zA-Z]+

That was an illustrative example from the Pegged docs. But yeah, you 
should just use a range; reads nicer.

 Literal <- ("'" .+ "'") / ('"' .+ '"')

 This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

 Still not sure if this is correct. Especially :
 Term <- Factor Factor*


 Another thing I never really understand is the "production" order, In
 other words : Why not top down ..
 Start :
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*

 ....

 End :
 EBNF <- Procuction+

 where End is Root..

 In fact grammars are usually devised the other way around, e.g.
 Start:
 Program -> ...
 Ehm... what the whole program is exactly ? Ok, let it be Declaration*
 for now. What kind of declarations do we have? ... and so on. Latter
 grammars get tweaked and extended numerous times.

 At any rate production order has no effect on the grammar, it's still
 the same. The only thing of importance is what non-terminal considered
 final (or start if you are LL-centric).

 TIA, Bjoern



-- 
- Alex

Mar 13 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Mar 13, 2012 at 18:05, Alex R=C3=B8nne Petersen <xtzgzorex gmail.co=
m> wrote:

 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*


 Why not:
 Identifier <- [a-zA-Z]+


 That was an illustrative example from the Pegged docs. But yeah, you shou=

ld
 just use a range; reads nicer.

The docs are for teaching PEG :) (btw, it's the docs describe C-like
identifiers, that's why I chose a longer approach)
It's always this 'tension', between inlining and refactoring.
[a-zA-Z]+ is shorter and more readable. But If you decide to extend
your grammar to UTF-32, it'd be easier to just change the 'letter'
rule.

Mar 13 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky <dmitry.olsh gmail.com> wro=
te:

 PEG defines order of alternatives, that is pretty much like a top-down
 recursive descent parser would parse it. Alternatives are tried from left=

 to
 right, if first one fails, it tries next and so on.

Yes.

 In an example I give B is always picked first and so C is never ever look=

ed
 at.

That's one of the caveats on PEG. That and greedy operators.

'a*a' never succeeds because 'a*' consumes all the available a's.


 Literal <- ("'" .+ "'") / ('"' .+ '"')

 This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

This is an example were PEG would munch anything till the end, and
fail (since """ is not found in an empty string).
The standard PEG way to do that is

(!EndMarker .)+ EndMarker

It's a common enough pattern I tend to define a parameterized rule for that=
:

Until(Expr) <- (!Expr .)* Expr


Gotta love parametrized rules! 15' to code, a neverending stream of joy.


 In fact grammars are usually devised the other way around, e.g.
 Start:
 =C2=A0Program -> ...
 Ehm... what the whole program is exactly ? Ok, let it be Declaration* for
 now. What kind of declarations do we have? ... and so on. Latter grammars
 get tweaked and extended numerous times.

 At any rate production order has no effect on the grammar, it's still the
 same. The only thing of importance is what non-terminal considered final =

(or
 start if you are LL-centric).

Yes. The PEG standard seem to be that the first rule is the start
(root) rule. Pegged let you decide which rule you use for a start. The
rest is automatic.
I might change that.

Mar 13 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 14.03.2012 2:49, Philippe Sigaud wrote:
 On Tue, Mar 13, 2012 at 17:17, Dmitry Olshansky<dmitry.olsh gmail.com>  wrote:

 PEG defines order of alternatives, that is pretty much like a top-down
 recursive descent parser would parse it. Alternatives are tried from left to
 right, if first one fails, it tries next and so on.

 Yes.

 In an example I give B is always picked first and so C is never ever looked
 at.

 That's one of the caveats on PEG. That and greedy operators.

 'a*a' never succeeds because 'a*' consumes all the available a's.

Hey, wait, I thought there has to be backtrack here, i.e. when second 
'a' finally fails?

 Literal<- ("'" .+ "'") / ('"' .+ '"')

 This needs escaping. Plain '.+' in pattern asks for trouble 99% of time.

 This is an example were PEG would munch anything till the end, and
 fail (since """ is not found in an empty string).
 The standard PEG way to do that is

 (!EndMarker .)+ EndMarker

 It's a common enough pattern I tend to define a parameterized rule for that:

 Until(Expr)<- (!Expr .)* Expr


 Gotta love parametrized rules! 15' to code, a neverending stream of joy.

Nice!

 In fact grammars are usually devised the other way around, e.g.
 Start:
   Program ->  ...
 Ehm... what the whole program is exactly ? Ok, let it be Declaration* for
 now. What kind of declarations do we have? ... and so on. Latter grammars
 get tweaked and extended numerous times.

 At any rate production order has no effect on the grammar, it's still the
 same. The only thing of importance is what non-terminal considered final (or
 start if you are LL-centric).

 Yes. The PEG standard seem to be that the first rule is the start
 (root) rule. Pegged let you decide which rule you use for a start. The
 rest is automatic.
 I might change that.


-- 
Dmitry Olshansky

Mar 14 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 That's one of the caveats on PEG. That and greedy operators.

 'a*a' never succeeds because 'a*' consumes all the available a's.

 Hey, wait, I thought there has to be backtrack here, i.e. when second 'a'
 finally fails?

PEG only do local backtracking in 'or' choices, not in sequences. I
think that's because the original author was dealing with packrat
parsing and its O(input-size) guarantee.

I remember trying compile-time backtracking in another similar library
in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
but I don't know the consequences. How do you do that in std.regex?

(nice article btw, I learnt some about regexes)

Mar 16 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 17.03.2012 10:59, Philippe Sigaud wrote:
 On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky<dmitry.olsh gmail.com>  wrote:

 That's one of the caveats on PEG. That and greedy operators.

 'a*a' never succeeds because 'a*' consumes all the available a's.

 Hey, wait, I thought there has to be backtrack here, i.e. when second 'a'
 finally fails?

 PEG only do local backtracking in 'or' choices, not in sequences. I
 think that's because the original author was dealing with packrat
 parsing and its O(input-size) guarantee.


Ok, let's agree on fact that semantically
a* is :
As <- a As / a

and a*? is this:
As <- a / a As

Now that's local ;)

 I remember trying compile-time backtracking in another similar library
 in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
 but I don't know the consequences. How do you do that in std.regex?


There are two fundamental ways to get around the problem, the easiest
to implement is to use a stack to record a position in input + number of 
alternative/production (I call it a PC - program counter) every time 
there is a "fork" like that. Then when the current path ultimately fails 
pop stack, reset input and go over again until you either match or empty 
the stack. That's how to avoid deeeep recursion that happens here.

The problematic thing now is combinatorial explosion of
a* a* a* .... //simplified, but you get the idea

The trick to avoid this sort of crap is to
1) agree that whatever matches first has higher priority (left-most 
match) that's a simple disambiguation rule.
2) record what positions + pc you place on stack e.g. use a set, or in 
fact, a bitmap to push every unique combination  (pc,index) only once.

Voila! Now you have a linear worst-case algorithm with (M*N) complexity 
where M is number of all possible PCs, and N is number of all possible 
indexes in input.

Now if we put aside all crap talk about mathematical purity and monads, 
and you name it, a packrat is essentially this, a dynamic programming 
trick applied to recursive top-down parsing. The trick could be extended 
even more to avoid extra checks on input characters, but the essence is 
this "memoization".

 (nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular.
Might as well keep me busy fixing bugs :)

-- 
Dmitry Olshansky

Mar 17 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sat, Mar 17, 2012 at 10:09, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 Ok, let's agree on fact that semantically
 a* is :
 As <- a As / a

 and a*? is this:
 As <- a / a As

 Now that's local ;)

It's local, yes. But the pb is with   Expr <- A* B C D, when D fails.
PEG sequences don't backtrack.


 I remember trying compile-time backtracking in another similar library
 in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
 but I don't know the consequences. How do you do that in std.regex?



 There are two fundamental ways to get around the problem

(snip)

Thanks for the explanations. Does std.regex implement this?


 Now if we put aside all crap talk about mathematical purity and monads, and
 you name it, a packrat is essentially this, a dynamic programming trick
 applied to recursive top-down parsing. The trick could be extended even more
 to avoid extra checks on input characters, but the essence is this
 "memoization".

I plan to store indices anyway. I might add this in the future. A read
something on using PEGs to parse Java and C and there was no need to
do a complete memoization: storing the last parse result as a
temporary seemed to be enough.

 (nice article btw, I learnt some about regexes)


 Thanks, I hope it makes them more popular.
 Might as well keep me busy fixing bugs :)

As people said on Reddit, you should make more whooping sounds about
CT-regex. That was what wowed me and pushed me to tackle CT-parsing
again.

Mar 17 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 17.03.2012 18:11, Philippe Sigaud wrote:
On Sat, Mar 17, 2012 at 10:09, Dmitry Olshansky<dmitry.olsh gmail.com> wrote:

Ok, let's agree on fact that semantically
a* is :
As<- a As / a

and a*? is this:
As<- a / a As

Now that's local ;)

It's local, yes. But the pb is with Expr<- A* B C D, when D fails.
PEG sequences don't backtrack.

I'd argue they do. As I see it as:
Expr <- As B C D / B C D
As <- A / A As
(or use an epsilon production for As, is it allowed in pegged ?)

I remember trying compile-time backtracking in another similar library
in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
but I don't know the consequences. How do you do that in std.regex?

There are two fundamental ways to get around the problem

(snip)

Thanks for the explanations. Does std.regex implement this?

It does the second way that I haven't told about, and had hard time to
implement in C-T :)
And the simple version of what I presented (i.e. stack stuff) is used
in CT-regex and as fallback in R-T. The problem with doing a bitmap
memoization is that regex often used to _search_ things in large inputs.
Some compromise and/or thresholds have to be estimated and found.
Parsing on the contrary guaranties that the whole input is used or in
error, hence bitmap is not wasted.

Now if we put aside all crap talk about mathematical purity and monads, and
you name it, a packrat is essentially this, a dynamic programming trick
applied to recursive top-down parsing. The trick could be extended even more
to avoid extra checks on input characters, but the essence is this
"memoization".

I plan to store indices anyway. I might add this in the future. A read
something on using PEGs to parse Java and C and there was no need to
do a complete memoization: storing the last parse result as a
temporary seemed to be enough.

Well even straight no-memorization is somehow enough, it's just a way to
ensure no extra work is done. C & Java are not much of backtracky grammars.

(nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular.
Might as well keep me busy fixing bugs :)

As people said on Reddit, you should make more whooping sounds about
CT-regex. That was what wowed me and pushed me to tackle CT-parsing
again.

It's just I'm also well aware of how much luck it (still) takes to fly ;)
The asserts that compare C-T vs R-T go off too often for my taste, other
then this the memory usage during compile is too high.
I recall once dmd had GC re-enabled for brief period, dmd used no more
then ~64Mb doing real crazy ct-regex stuff.

OK, back to C-T parsing, I have one crazy idea that I can't get away
from - add operator precedence grammar into the mix. From what I observe
it should integrate into PEG smoothly. Then it would make
"military-grade" hybrid that uses operator precedence parsing of
expressions and the like. Few more tricks and it may beat some
existing parser generators.

See this post, where I tried to describe that idea early on:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=159562

I might catch spare time to go about adding it myself, the only tricky
thing is to embed plain semantic actions, as AST generation would be
more or less the same.

--
Dmitry Olshansky

Mar 17 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sat, Mar 17, 2012 at 15:48, Dmitry Olshansky <dmitry.olsh gmail.com> wro=
te:

 PEG sequences don't backtrack.


 I'd argue they do. As I see it as:
 Expr <- As B C D / B C D
 As <- A / A As

That's what people doing Regex-to-PEG translations do, yes. But it's
not the spontaneous behavior of A* B C D in PEG.

But that means I could add a switch to transform the expressions that way.


 (or use an epsilon production for As, is it allowed in pegged ?)

I called it 'Eps', it's a predefined parser that always matches and
consumes nothing.
I used the greek epsilon at the beginning (=CE=B5) but thought that many
people would shout at this :)


 OK, back to C-T parsing, I have one crazy idea that I can't get away from=

 -
 add operator precedence grammar into the mix. From what I observe it shou=

ld
 integrate into PEG smoothly. Then it would make "military-grade" hybrid t=

hat
 uses operator precedence parsing of expressions and the like. Few more
 tricks and it may beat some
 existing parser generators.

 See this post, where I tried to describe that idea early on:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=

.D&article_id=3D159562

I remember reading this. But I don't feel I'm up to it for now.


 I might catch spare time to go about adding it myself, the only tricky th=

ing
 is to embed plain semantic actions, as AST generation would be more or le=

ss
 the same.

Cool!

Mar 17 2012

=?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:

On 12-03-2012 13:43, bls wrote:
 On 03/10/2012 03:28 PM, Philippe Sigaud wrote:
 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 Just WOW!

 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF Pegged
 A = BC. A <- B C
 A = B|C. A <- C / C
 A = [B]. A <- B?
 A = {B}. A <- B*


 Having EBNF expressed in Pegged ...

 EBNF <- Procuction+
 Production <- Identifier '=' Expression '.'
 Expression <- Term ( '|' Term)*
 Term <- Factor Factor*
 Factor <- Identifier /
 Literal /
 '[' Expression ']'
 / '{' Expression }'/
 '(' Expression ')'
 lowerCase <- [a-z]
 upperCase <- [A-Z]
 Identifier <- (lowerCase / upperCase) (lowerCase / upperCase)*
 Literal <- ("'" .+ "'" / '"' .+ '"')

 Due to the fact that EBNF can be expressed in EBNF it should be possible
 to parse an arbitrary EBNF file and generate PEG output.
 What do you think ?

 BTW.. Is my EBNF PEG description correct ? TIA Bjoern

The thing is, PEG cannot represent an ambiguous grammar. It is fully 
ordered, so it's just plain impossible.

That's not to say you can't turn an EBNF grammar into PEG, but it won't 
necessarily be useful in all cases.

-- 
- Alex

Mar 13 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Mar 12, 2012 at 13:43, bls <bizprac orange.fr> wrote:

 Just WOW!

Thanks! Don't be too excited, it's still quite slow as a parser. But
that is a fun project :)

 Nice to have on your WIKI would be a EBNF to PEG sheet.

 Wirth EBNF =C2=A0 =C2=A0 =C2=A0Pegged
 A =3D BC. =C2=A0 =C2=A0 =C2=A0 =C2=A0 A <- B C
 A =3D B|C. =C2=A0 =C2=A0 =C2=A0 =C2=A0A <- C / C
 A =3D [B]. =C2=A0 =C2=A0 =C2=A0 =C2=A0A <- B?
 A =3D {B}. =C2=A0 =C2=A0 =C2=A0 =C2=A0A <- B*

fact is, I don't know EBNF that much. I basically learned everything I
know about parsing or grammars while coding Pegged in February :) I
probably made every mistakes in the book.

Hey, it's a github public wiki, I guess you can create a new page?

Mar 13 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

I am impressed. That's a really nice showcase for the D compile 
time features.

Can I use PEG to parse languages like python and haskell where 
indention matters without preprocessing?

Will you make it work with input ranges of dchar? So that I can 
easily plug in some preprocessing steps?

Mar 13 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Saturday, 10 March 2012 at 23:28:42 UTC, Philippe Sigaud wrote:
 * Grammars can be dumped in a file to create a D module.


  In reading the D spec I've seen a few instance where there are 
infered items such as auto for variables and various parameters 
in templates.  I presume your D grammar will have to have some 
rules to be able to infer these as well.

It seems like it would not be a big step to output what are the 
infered proper statements as the .capture output.  Is that 
correct?  I think that would be helpful in some cases to be able 
to view the fully expanded expression.

Mar 14 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 03/15/2012 12:09 AM, Jay Norwood wrote:
 On Saturday, 10 March 2012 at 23:28:42 UTC, Philippe Sigaud wrote:
 * Grammars can be dumped in a file to create a D module.


 In reading the D spec I've seen a few instance where there are infered
 items such as auto for variables and various parameters in templates. I
 presume your D grammar will have to have some rules to be able to infer
 these as well.

 It seems like it would not be a big step to output what are the infered
 proper statements as the .capture output. Is that correct? I think that
 would be helpful in some cases to be able to view the fully expanded
 expression.

This is not what a parser does. Resolving symbols and assigning types to 
expressions is part of semantic analysis.

Mar 15 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 11 Mar 2012 00:28:42 +0100, Philippe Sigaud  
<philippe.sigaud gmail.com> wrote:

 Hello,

 I created a new Github project, Pegged, a Parsing Expression Grammar  
 (PEG) generator in D.

 https://github.com/PhilippeSigaud/Pegged

 docs: https://github.com/PhilippeSigaud/Pegged/wiki

 PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar

 The idea is to give the generator a PEG with the standard syntax.  From  
 this grammar definition, a set of related parsers will be created, to be  
 used at runtime or compile time.

Just wanted to draw your attention on a pull request of mine.
https://github.com/D-Programming-Language/dmd/pull/426

Using an added option (-M|Mf|Md) dmd will gather all mixin
strings into a file. This is very helpful for better error
messages and debugging.

Mar 22 2012

"Elie Morisse" <syniurge gmail.com> writes:

Hi Philippe,

I wrote a Kate(the official text editor of KDE, also used in 
KDevelop) syntax highlighting file for Pegged:

http://www.homo-nebulus.fr/dlang/pegged.xml

Screenshot:
http://imgur.com/sb3jFqs.png

Of course for the syntax highlighting to work the grammar must be 
in a separate file and included like this in a .d file:

   mixin(grammar(import("Tales.pgd")));

Aug 07 2013

D Programming

C/C++ Programming

Other

digitalmars.D.announce - Pegged, a Parsing Expression Grammar (PEG) generator in D