digitalmars.D.learn - Writing a Parser
- Dan <murpsoft hotmail.com> Jan 04 2008
- "Jarrett Billingsley" <kb3ctd2 yahoo.com> Jan 04 2008
- Alan Knowles <alan akbkhome.com> Jan 04 2008
- Jascha Wetzel <firstname mainia.de> Jan 04 2008
- 0ffh <frank youknow.what.todo.interNETz> Jan 05 2008
- Alan Knowles <alan akbkhome.com> Jan 08 2008
- Jascha Wetzel <firstname mainia.de> Jan 09 2008
- Dan <murpsoft hotmail.com> Jan 09 2008
- Jascha Wetzel <firstname mainia.de> Jan 10 2008
- Dan <murpsoft hotmail.com> Jan 12 2008
- Jascha Wetzel <firstname mainia.de> Jan 12 2008
- Dan <murpsoft hotmail.com> Jan 06 2008
- Ap Lee <aplee primus.ca> Jan 09 2008
- Ap Lee <aplee primus.ca> Jan 09 2008
- Frank Benoit <keinfarbton googlemail.com> Jan 09 2008
- Alan Knowles <alan akbkhome.com> Jan 09 2008
I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
At the moment, I'm trying recursive descent parsing.
The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.
For example, to consume a for loop, you consume something similar to
/for\s*\((.*?)\)\s*\{(.*?)\}/
I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing. Is there any way to ease the pain?
Regards,
Dan
Jan 04 2008
"Dan" <murpsoft hotmail.com> wrote in message news:flmtrv$2jrn$1 digitalmars.com...I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?
Separate tokenization and syntax parsing? It makes things a hell of a lot easier. You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand. The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc.
Jan 04 2008
Yes, normally when doing OO patterns for this
class Parser extends Tokenizer {...}
either that or
class Parser {
Tokenizer reader;
parser code....
}
Recommended reading
- the parser in DMD's source (not DMDScript) is quite nice - it's a hand
coded one and is quite easy to understand.
This is interesting from the perspective of using function pointers to
do pattern testing (it's horribly borked from the perspective that it's
really the tokenizer..)
http://www.dsource.org/projects/leds/browser/trunk/src/language/php/Parser.d
If you dont want a hand coded parser, have a look at csLex - It's what
mono used, and is pretty trivial to modify, and retarget to generate
other language code (eg. D code...)
From someone who's written far too many parsers ;)
Regards
Alan
Jarrett Billingsley wrote:
"Dan" <murpsoft hotmail.com> wrote in message
news:flmtrv$2jrn$1 digitalmars.com...
I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
At the moment, I'm trying recursive descent parsing.
The problem is that I've realized I'm duplicating huge volumes of code to
cope with the tristate decision of { unexpected, allow, require } for any
given token.
For example, to consume a for loop, you consume something similar to
/for\s*\((.*?)\)\s*\{(.*?)\}/
I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing. Is there any way to ease the pain?
Separate tokenization and syntax parsing? It makes things a hell of a lot
easier. You don't even necessarily have to tokenize the source entirely
before parsing; just have a lexer which lexes tokens out of the source on
demand. The syntax parsing is then unencumbered from dealing with the raw
source and just has to do stuff like "expect 'for', expect left-paren,
expect (your condition), expect right-paren" etc.
Jan 04 2008
Dan wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?
a parser generator :) writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated. even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
Jan 04 2008
Jascha Wetzel wrote:Dan wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. [...] I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?
a parser generator :) [...]
Or a parsing combinator library. :) And it's always nice to roll your own, if you have the extra time. Not only to have, but to know. regards, frank
Jan 05 2008
I did spend some time looking at aPaGeD for this yesteday - here's some feedback that may help (and if you can answer some of the questions that would help me alot as well) - It' works (yes, but you would be amazed, for the life of me I could not get antlr to do anything - java write once, run nowhere...) - so being able to generate code and working programs from the examples was great, and a real +100 for the project.. - Syntax - on the face of it looks reasonable, and easy to understand. - similar enough to antlr.. (That's the end of the really good stuff) - Documentation While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables: Specifically I've no idea what the meanings of these are, and they are rather critical to the docs....: "Terminal" "Non-Terminal" - Regex's While I can see the benefit's I'd much rather the compiler built them for me.. - part of the beauty of the BNF format is that it's easy to read, and explains regex style situations alot better.. - Otherwise (see below about explaining how they can deal with classic situations...) - How to handle classic situations This is the key to success for the Documentation. (and what is seriously missing) - as most people will probably have come from a lexx/yacc background... These are a few classic examples that the Documentation could do with. * Top level parser starts. Most grammers start with a top level statement, eg. Program: Statements; In which case the application should only start by solving Statements, - the biggest problem I found was that I had no idea how to stop it matching any of the condition rules (that were only relivant to a specific state - eg. see next example) * Parse a string This is a common pattern but it's quite difficult to see how to implement it. -- And as above, when I tried, the parser started matching DoubleQuotedStringChars at the start of the file (even though it's only used in DoubleQuotedString. DoubleQuotedString; QUOTE DoubleQuotedStringChars QUOTE DoubleQuotedStringChars: (DoubleQuotedStringChar)* DoubleQuotedStringChar: "\" ANYCHAR: ^QUOTE; * Classic groupings: (.....)* eg. many of these matches.. (.....)+ eg. one or more of these matches.. (.....)? eg. one or none of these matches.. (.....)=> ... if forward lookup succeeds on (...) try match next combo. Regards Alan Jascha Wetzel wrote:Dan wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?
a parser generator :) writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated. even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
Jan 08 2008
Alan Knowles wrote:- It' works (yes, but you would be amazed, for the life of me I could not get antlr to do anything - java write once, run nowhere...) - so being able to generate code and working programs from the examples was great, and a real +100 for the project..
I'm glad to hear that, although, i must admit, i was so bold to assume that already ;)- Documentation While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables:
Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately. Therefore i felt it should be sufficient to describe everything that's specific to apaged. Once i find the time, i'll extend the documentation with a thorough hands-on guide.Specifically I've no idea what the meanings of these are, and they are rather critical to the docs....: "Terminal" "Non-Terminal"
A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same. In apaged, a Terminal is always a string or a regexp. A Non-Terminal is specified by a set of production rules that tell how to match a sequence of Terminals and Non-Terminals. In apaged, a Non-Terminal is always something looking like this: MyNT { "a" MyOtherNT MyNT; "b" Nt2 { /* D Code ... */ } } If you think of parse trees when dealing with your grammar (if not, ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.- Regex's While I can see the benefit's I'd much rather the compiler built them for me.. - part of the beauty of the BNF format is that it's easy to read, and explains regex style situations alot better.. - Otherwise (see below about explaining how they can deal with classic situations...)
Do you mean regexp-like syntax for rules (i.e. EBNF syntax)? This is a feature on my todo list, but i still have to find a nice way to merge this with the semantic code (which is non-trivial) - a thing for the future. If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what you mean, since apaged does compile them for you. You don't need to use them at all, though. They're merely a convenience feature.- How to handle classic situations This is the key to success for the Documentation. (and what is seriously missing) - as most people will probably have come from a lexx/yacc background... These are a few classic examples that the Documentation could do with. * Top level parser starts. Most grammers start with a top level statement, eg. Program: Statements; In which case the application should only start by solving Statements, - the biggest problem I found was that I had no idea how to stop it matching any of the condition rules (that were only relivant to a specific state - eg. see next example)
The docs state that "The first non-terminal in a grammar file will be treated as the start symbol for the grammar." I admit that it should be "...in the main grammar file..." since that doesn't apply for imported grammar files.* Parse a string This is a common pattern but it's quite difficult to see how to implement it. -- And as above, when I tried, the parser started matching DoubleQuotedStringChars at the start of the file (even though it's only used in DoubleQuotedString. DoubleQuotedString; QUOTE DoubleQuotedStringChars QUOTE DoubleQuotedStringChars: (DoubleQuotedStringChar)* DoubleQuotedStringChar: "\" ANYCHAR: ^QUOTE;
Hm, it shouldn't do that. I'll assume that's because of how QUOTE is defined. I'd need to see the whole grammar. Besides that, you use 2 unsupported EBNF features in the above grammar: ^QUOTE and (DoubleQuotedStringChar)* Apaged doesn't support full EBNF syntax for the reason mentioned above.* Classic groupings: (.....)* eg. many of these matches.. (.....)+ eg. one or more of these matches.. (.....)? eg. one or none of these matches.. (.....)=> ... if forward lookup succeeds on (...) try match next combo.
None of those are supported. You need to write them BNF style: A* becomes As { As A; epsilon; } A+ becomes Ap { Ap A; A; } A? becomes Aq { A; epsilon; } Lookahead isn't supported for rules either. You may lookahead a single Terminal symbol, though, using >"a" or >["a" "b"], as documented. Rule-level lookahead can also be rewritten BNF style, but it may affect more related rules and therefore depends on your instance of the problem. I hope that helps a bit. If you read up on BNF vs. EBNF and consider that apaged is BNF based, you should find solutions to all of these problems.
Jan 09 2008
: D So far I've got it doing an LL(0) predictive tokenizer which generates a [still pretty buggy] AST. I'm quite proud of it at the moment, as I'm certain now that I can accomplish LL(0) scanning for everything but binaryOperators; where it's LL(n) | n E expression. Jascha Wetzel Wrote:Alan Knowles wrote:- Documentation While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables:
Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately.
I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same.
Linguistics assigns them very different meanings.If you think of parse trees when dealing with your grammar
Are those like sentence trees, with noun phrase, verb phrase, etc?ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.
That makes sense.- How to handle classic situations
When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: { { /* } */ } , { } } Another classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that? Both are highly important, but I find most parser writers only care about the former, and that the product is a working AST. If I only wanted that, I could write the interpreter entirely in javascript regular expressions and it'll only be 40 lines of code. In fact, I think someone did that already, but I'm sure he wasn't that terse. Regards, Dan
Jan 09 2008
Dan wrote:Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately.
I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.
That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has. Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same.
Linguistics assigns them very different meanings.
Formal languages do too, but for practical use it's not much of a difference.If you think of parse trees when dealing with your grammar
Are those like sentence trees, with noun phrase, verb phrase, etc?
Probably. I don't know linguistics. Example: S -> A A -> aBa B -> bCb C -> c C -> A parsing ababcbaba yields S | _____ A _____ / | \ a ___ B____ a / | \ b _ C__ b / | \ a B a /|\ b C b | c- How to handle classic situations
When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: { { /* } */ } , { } }
it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing.Another classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that?
the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.
Jan 10 2008
Jascha Wetzel Wrote:Dan wrote:Like MathML, it's way too far from the machine to generate an *efficient* parser.
often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has.
Ah, but there's always overhead.Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).
Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing.
Aha! Well then, the way I wrote my scanner/parser, a whole tree is built before parsing. It's not fully functional yet, but I'm not seeing any design failures.Another classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that?
the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.
So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected? That makes sense. I did it similar to that.
Jan 12 2008
Dan wrote:That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has.
Ah, but there's always overhead.
What i meant was, that this overhead has a purpose. Once your grammar uses it, it's not overhead anymore. Automatically generated parsers usually facilitate bottom-up parsing algorithms (LALR, GLR) that have several advantages over top-down algorithms (LL(k), usually implemented as recursive descent) that are used in manually written parsers. LALR for example is completely iterative (i.e. no stack action, no calls, less code -> completely cached), making it more efficient than recursive descent. GLR is almost completely iterative, depending on the implementation. If an automatically generated parser is slower, it is usually due to features that make it more flexible wrt. to grammars. Therefore, as stated initially, once you need this flexibility, the parser performs nicely. Another thing is backtracking. RD can use lookahead, but it is more elaborate to implement it manually and generally, top-down lookahead isn't as effective as bottom-up lookahead. That is, it takes more complex grammars to create the need for backtracking in a bottom-up parser than in a top-down parser. That is where bottom-up parsing is categorically faster than top-down. All this makes me claim that it'll be hard to write for example a recursive descent D parser by hand, that performs better than it's automatically generated GLR counterpart. You might still succeed by a small margin, but you will have spent a *lot* more time to do so. And you will have a parser that is not near as easily maintainable as the grammar used for the automatically generated parser.Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).
Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.
The tango package has >400 files with multiple MB of code. I doubt there is any script or library written in JavaScript that is even close to that size. Parsing single files <1000 LoC is a matter of <1ms with apaged or any other LALR/GLR parser, no need to worry ;).So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected? That makes sense. I did it similar to that.
In this case it even only needs to look at the next token. You're usually right with what you do to solve such a case if you don't need to backtrack. If so, you should double check that there is no other way.
Jan 12 2008
Dan Wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing.
I've got it performing rather nicely now, for it's completeness. parseOperand() returns an { ident, int, double, string, regexp, expression, arrayliteral, objectliteral *value* } incomplete for e notation numbers incomplete for expression, arrayliteral, objectliteral parseBinaryOperator() returns one of the binary operator tokens parseOptionalWS() consumes whitespace and doesn't return anything I've got some more written, but they're not tested. Once I have numbers done, I'll try writing the arrayLiteral, which will of course, parseOperand() "," parseOperand() : )
Jan 06 2008
Dan Wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing.
Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.
Jan 09 2008
Dan Wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing.
Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.
Jan 09 2008
Ap Lee schrieb:Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.
I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/
Jan 09 2008
I tried antlrd on debian, while I could get it to install (apt-get for pure antlr) and download antlrd, All the examples resulted in an error message: #runantlr /tmp/SimpleC.g /tmp/SimpleC.g:3:1: unexpected token: grammar Which did not have any google results for potential reasons.. Regards Alan Frank Benoit wrote:Ap Lee schrieb:Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.
I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/
Jan 09 2008









Alan Knowles <alan akbkhome.com> 