digitalmars.D.learn - Writing a Parser

Dan (8/8) Jan 04 2008 I've been messing with how to write a parser, and so far I've played wit...

Jarrett Billingsley (8/18) Jan 04 2008 Separate tokenization and syntax parsing? It makes things a hell of a l...

Alan Knowles (21/46) Jan 04 2008 Yes, normally when doing OO patterns for this

Jascha Wetzel (7/17) Jan 04 2008 a parser generator :)

0ffh (5/14) Jan 05 2008 Or a parsing combinator library. :)
Alan Knowles (58/80) Jan 08 2008 I did spend some time looking at aPaGeD for this yesteday - here's some

Jascha Wetzel (66/121) Jan 09 2008 I'm glad to hear that, although, i must admit, i was so bold to assume

Dan (18/35) Jan 09 2008 : D

Jascha Wetzel (43/66) Jan 10 2008 That depends on the grammar that is being parsed. Simple grammars can

Dan (6/37) Jan 12 2008 Eep. That would instantly make any JavaScript interpreter a failure; sc...

Jascha Wetzel (34/47) Jan 12 2008 What i meant was, that this overhead has a purpose. Once your grammar

Dan (12/16) Jan 06 2008 I've got it performing rather nicely now, for it's completeness.

Ap Lee (6/11) Jan 09 2008 Have you looked at ANTLR? I have used it for two projects and I am quite...
Ap Lee (6/11) Jan 09 2008 Have you looked at ANTLR? I have used it for two projects and I am quite...

Frank Benoit (3/10) Jan 09 2008 I haven't tested it, but there is a Antlr D backend:

Alan Knowles (9/20) Jan 09 2008 I tried antlrd on debian, while I could get it to install (apt-get for

Christoph Singewald (6/20) Jan 10 2008 Here you can find some parsergenerators:

bearophile (6/8) Jan 10 2008 I have seen re2c and it looks nice. I think its source code can be modif...

Dan <murpsoft hotmail.com> writes:

I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.

At the moment, I'm trying recursive descent parsing.

The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.

For example, to consume a for loop, you consume something similar to
/for\s*\((.*?)\)\s*\{(.*?)\}/

I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing.  Is there any way to ease the pain?

Regards,
Dan

Jan 04 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Dan" <murpsoft hotmail.com> wrote in message 
news:flmtrv$2jrn$1 digitalmars.com...
 I've been messing with how to write a parser, and so far I've played with 
 numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.

 The problem is that I've realized I'm duplicating huge volumes of code to 
 cope with the tristate decision of { unexpected, allow, require } for any 
 given token.

 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/

 I have it doing that, but my soul feels heavy with the masses of looped 
 switches it's doing.  Is there any way to ease the pain?

Separate tokenization and syntax parsing?  It makes things a hell of a lot 
easier.  You don't even necessarily have to tokenize the source entirely 
before parsing; just have a lexer which lexes tokens out of the source on 
demand.  The syntax parsing is then unencumbered from dealing with the raw 
source and just has to do stuff like "expect 'for', expect left-paren, 
expect (your condition), expect right-paren" etc.

Jan 04 2008

Alan Knowles <alan akbkhome.com> writes:

Yes, normally when doing OO patterns for this

class Parser extends Tokenizer {...}

either that or
class Parser {
	Tokenizer reader;
	parser code....
}

Recommended reading
- the parser in DMD's source (not DMDScript) is quite nice - it's a hand 
coded one and is quite easy to understand.

This is interesting from the perspective of using function pointers to 
do pattern testing (it's horribly borked from the perspective that it's 
really the tokenizer..)
http://www.dsource.org/projects/leds/browser/trunk/src/language/php/Parser.d

If you dont want a hand coded parser, have a look at csLex - It's what 
mono used, and is pretty trivial to modify, and retarget to generate 
other language code (eg. D code...)

 From someone who's written far too many parsers ;)
Regards
Alan




Jarrett Billingsley wrote:
 "Dan" <murpsoft hotmail.com> wrote in message 
 news:flmtrv$2jrn$1 digitalmars.com...
 I've been messing with how to write a parser, and so far I've played with 
 numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.

 The problem is that I've realized I'm duplicating huge volumes of code to 
 cope with the tristate decision of { unexpected, allow, require } for any 
 given token.

 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/

 I have it doing that, but my soul feels heavy with the masses of looped 
 switches it's doing.  Is there any way to ease the pain?

 
 Separate tokenization and syntax parsing?  It makes things a hell of a lot 
 easier.  You don't even necessarily have to tokenize the source entirely 
 before parsing; just have a lexer which lexes tokens out of the source on 
 demand.  The syntax parsing is then unencumbered from dealing with the raw 
 source and just has to do stuff like "expect 'for', expect left-paren, 
 expect (your condition), expect right-paren" etc.

Jan 04 2008

Jascha Wetzel <firstname mainia.de> writes:

Dan wrote:
 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
 
 At the moment, I'm trying recursive descent parsing.
 
 The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.
 
 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/
 
 I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing.  Is there any way to ease the pain?

a parser generator :)
writing a parser or scanner manually is a bit like writing any program 
in assembler - tedious, error-prone and not well maintainable. there's a 
lot of stuff in a parser that can be automatically generated.
even if you want to write the parser all by yourself, i'd rather suggest 
you write a simple parser generator to do that tedious part for you.

Jan 04 2008

0ffh <frank youknow.what.todo.interNETz> writes:

Jascha Wetzel wrote:
 Dan wrote:
 I've been messing with how to write a parser, and so far I've played 
 with numerous patterns before eventually wanting to cry.
  [...]
 I have it doing that, but my soul feels heavy with the masses of 
 looped switches it's doing.  Is there any way to ease the pain?

 
 a parser generator :)
 [...]

Or a parsing combinator library. :)

And it's always nice to roll your own, if you have the extra time.
Not only to have, but to know.

regards, frank

Jan 05 2008

Alan Knowles <alan akbkhome.com> writes:

I did spend some time looking at aPaGeD for this yesteday - here's some 
feedback that may help (and if you can answer some of the questions that 
would help me alot as well)

- It' works (yes, but you would be amazed, for the life of me I could 
not get antlr to do anything - java write once, run nowhere...) - so 
being able to generate code and working programs from the examples was 
great, and a real +100 for the project..

- Syntax - on the face of it looks reasonable, and easy to understand. - 
similar enough to antlr..
(That's the end of the really good stuff)


- Documentation
While I know it's a pain to write, the things you have already tend to 
focus on how the parser is built, and are biased to someone 
understanding the internals and phrase-ology involved in parsers, rather 
than an end user - who just knows if I'm looking for this.. - then put 
this, and the result is available in these variables:

Specifically I've no idea what the meanings of these are, and they are 
rather critical to the docs....:
"Terminal" "Non-Terminal"

- Regex's
While I can see the benefit's I'd much rather the compiler built them 
for me.. - part of the beauty of the BNF format is that it's easy to 
read, and explains regex style situations alot better.. - Otherwise (see 
below about explaining how they can deal with classic situations...)

- How to handle classic situations
This is the key to success for the Documentation. (and what is seriously 
missing) - as most people will probably have come from a lexx/yacc 
background...

These are a few classic examples that the Documentation could do with.

* Top level parser starts.
Most grammers start with a top level statement, eg.
Program:
	Statements;

In which case the application should only start by solving Statements, - 
the biggest problem I found was that I had no idea how to stop it 
matching any of the condition rules (that were only relivant to a 
specific state - eg. see next example)


* Parse a string
This is a common pattern but it's quite difficult to see how to 
implement it. -- And as above, when I tried, the parser started matching 
DoubleQuotedStringChars at the start of the file (even though it's only 
used in  DoubleQuotedString.

DoubleQuotedString;
	QUOTE DoubleQuotedStringChars QUOTE

DoubleQuotedStringChars:
	(DoubleQuotedStringChar)*

DoubleQuotedStringChar:
	"\" ANYCHAR:
	^QUOTE;
	
* Classic groupings:
	(.....)*    eg. many of these matches..
	(.....)+    eg. one or more of these matches..
	(.....)?    eg. one or none of these matches..
	(.....)=> ...    if forward lookup succeeds on (...) try match next combo.


Regards
Alan









Jascha Wetzel wrote:
 Dan wrote:
 I've been messing with how to write a parser, and so far I've played 
 with numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.

 The problem is that I've realized I'm duplicating huge volumes of code 
 to cope with the tristate decision of { unexpected, allow, require } 
 for any given token.

 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/

 I have it doing that, but my soul feels heavy with the masses of 
 looped switches it's doing.  Is there any way to ease the pain?

 
 a parser generator :)
 writing a parser or scanner manually is a bit like writing any program 
 in assembler - tedious, error-prone and not well maintainable. there's a 
 lot of stuff in a parser that can be automatically generated.
 even if you want to write the parser all by yourself, i'd rather suggest 
 you write a simple parser generator to do that tedious part for you.

Jan 08 2008

Jascha Wetzel <firstname mainia.de> writes:

Alan Knowles wrote:
 - It' works (yes, but you would be amazed, for the life of me I could 
 not get antlr to do anything - java write once, run nowhere...) - so 
 being able to generate code and working programs from the examples was 
 great, and a real +100 for the project..

I'm glad to hear that, although, i must admit, i was so bold to assume 
that already ;)

 - Documentation
 While I know it's a pain to write, the things you have already tend to 
 focus on how the parser is built, and are biased to someone 
 understanding the internals and phrase-ology involved in parsers, rather 
 than an end user - who just knows if I'm looking for this.. - then put 
 this, and the result is available in these variables:

Yeah, i became aware of that through the feedback. Without thinking too 
much, i assumed that using parsers would be something you'd do only if 
you dealt with parsers intimately. Therefore i felt it should be 
sufficient to describe everything that's specific to apaged.
Once i find the time, i'll extend the documentation with a thorough 
hands-on guide.

 Specifically I've no idea what the meanings of these are, and they are 
 rather critical to the docs....:
 "Terminal" "Non-Terminal"

A Terminal is a symbol that is "final" wrt. expansion, while a 
Non-Terminal can be expanded by some rules. Terminals, Tokens and 
Lexemes are *practically* more or less the same. In apaged, a Terminal 
is always a string or a regexp.
A Non-Terminal is specified by a set of production rules that tell how 
to match a sequence of Terminals and Non-Terminals. In apaged, a 
Non-Terminal is always something looking like this:

MyNT
{
   "a" MyOtherNT MyNT;
   "b" Nt2
   { /* D Code ... */ }
}

If you think of parse trees when dealing with your grammar (if not, 
ignore the rest), Non-Terminals are the inner nodes, and Terminals are 
the leaves.

 - Regex's
 While I can see the benefit's I'd much rather the compiler built them 
 for me.. - part of the beauty of the BNF format is that it's easy to 
 read, and explains regex style situations alot better.. - Otherwise (see 
 below about explaining how they can deal with classic situations...)

Do you mean regexp-like syntax for rules (i.e. EBNF syntax)?
This is a feature on my todo list, but i still have to find a nice way 
to merge this with the semantic code (which is non-trivial) - a thing 
for the future.

If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what 
you mean, since apaged does compile them for you. You don't need to use 
them at all, though. They're merely a convenience feature.

 - How to handle classic situations
 This is the key to success for the Documentation. (and what is seriously 
 missing) - as most people will probably have come from a lexx/yacc 
 background...
 
 These are a few classic examples that the Documentation could do with.
 
 * Top level parser starts.
 Most grammers start with a top level statement, eg.
 Program:
     Statements;
 
 In which case the application should only start by solving Statements, - 
 the biggest problem I found was that I had no idea how to stop it 
 matching any of the condition rules (that were only relivant to a 
 specific state - eg. see next example)

The docs state that
"The first non-terminal in a grammar file will be treated as the start 
symbol for the grammar."

I admit that it should be "...in the main grammar file..." since that 
doesn't apply for imported grammar files.

 * Parse a string
 This is a common pattern but it's quite difficult to see how to 
 implement it. -- And as above, when I tried, the parser started matching 
 DoubleQuotedStringChars at the start of the file (even though it's only 
 used in  DoubleQuotedString.
 
 DoubleQuotedString;
     QUOTE DoubleQuotedStringChars QUOTE
 
 DoubleQuotedStringChars:
     (DoubleQuotedStringChar)*
 
 DoubleQuotedStringChar:
     "\" ANYCHAR:
     ^QUOTE;

Hm, it shouldn't do that. I'll assume that's because of how QUOTE is 
defined. I'd need to see the whole grammar.
Besides that, you use 2 unsupported EBNF features in the above grammar:
^QUOTE and
(DoubleQuotedStringChar)*
Apaged doesn't support full EBNF syntax for the reason mentioned above.

 * Classic groupings:
     (.....)*    eg. many of these matches..
     (.....)+    eg. one or more of these matches..
     (.....)?    eg. one or none of these matches..
     (.....)=> ...    if forward lookup succeeds on (...) try match next 
 combo.

None of those are supported. You need to write them BNF style:
A* becomes

As {
   As A;
   epsilon;
}

A+ becomes

Ap {
   Ap A;
   A;
}

A? becomes

Aq {
   A;
   epsilon;
}

Lookahead isn't supported for rules either. You may lookahead a single 
Terminal symbol, though, using >"a" or >["a" "b"], as documented.
Rule-level lookahead can also be rewritten BNF style, but it may affect 
more related rules and therefore depends on your instance of the problem.


I hope that helps a bit. If you read up on BNF vs. EBNF and consider 
that apaged is BNF based, you should find solutions to all of these 
problems.

Jan 09 2008

Dan <murpsoft hotmail.com> writes:

: D

So far I've got it doing an LL(0) predictive tokenizer which generates a [still
pretty buggy] AST.

I'm quite proud of it at the moment, as I'm certain now that I can accomplish
LL(0) scanning for everything but binaryOperators; where it's LL(n) | n E
expression.


Jascha Wetzel Wrote:
 Alan Knowles wrote:
 - Documentation
 While I know it's a pain to write, the things you have already tend to 
 focus on how the parser is built, and are biased to someone 
 understanding the internals and phrase-ology involved in parsers, rather 
 than an end user - who just knows if I'm looking for this.. - then put 
 this, and the result is available in these variables:

 
 Yeah, i became aware of that through the feedback. Without thinking too 
 much, i assumed that using parsers would be something you'd do only if 
 you dealt with parsers intimately. 

I took linguistics, and I'm an interpreter writer, and I still haven't looked
at BNF or YACC/BISON notation yet.  To be honest, I'm not interested in it. 
Like MathML, it's way too far from the machine to generate an *efficient*
parser.  Mine might not be efficient, but that wouldn't be intrinsic to it
being low-level.

 A Terminal is a symbol that is "final" wrt. expansion, while a 
 Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are
*practically* more or less the same. 

Linguistics assigns them very different meanings.

 If you think of parse trees when dealing with your grammar

Are those like sentence trees, with noun phrase, verb phrase, etc?

 ignore the rest), Non-Terminals are the inner nodes, and Terminals are 
 the leaves.

That makes sense.

 - How to handle classic situations


When I first started with Walnut 0.x, I was grinding my brain trying to figure
out how to match these correctly:

{  { /* } */ }  , { } }

Another classical problem is JavaScript RegExp literals or divide:

/bob/i  can be "divide bob divide i", or a regexp, depending on whether we
expect an operator or operand.

How would you write that?
How would the machine read that?

Both are highly important, but I find most parser writers only care about the
former, and that the product is a working AST.

If I only wanted that, I could write the interpreter entirely in javascript
regular expressions and it'll only be 40 lines of code.  In fact, I think
someone did that already, but I'm sure he wasn't that terse.

Regards,
Dan

Jan 09 2008

Jascha Wetzel <firstname mainia.de> writes:

Dan wrote:
 Yeah, i became aware of that through the feedback. Without thinking too 
 much, i assumed that using parsers would be something you'd do only if 
 you dealt with parsers intimately. 

 
 I took linguistics, and I'm an interpreter writer, and I still haven't looked
at BNF or YACC/BISON notation yet.  To be honest, I'm not interested in it. 
Like MathML, it's way too far from the machine to generate an *efficient*
parser.  Mine might not be efficient, but that wouldn't be intrinsic to it
being low-level.

That depends on the grammar that is being parsed. Simple grammars can 
often be parsed faster with hand-optimized parsers. The more complex the 
grammar is, the less impact the generated parsers' overhead has.
Apaged was tailored for parsing D, and it's very fast for that. Last 
time i checked, it parsed the complete Tango package in less than 2 
seconds (including disk io).


 A Terminal is a symbol that is "final" wrt. expansion, while a 
 Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are
*practically* more or less the same. 

 
 Linguistics assigns them very different meanings.

Formal languages do too, but for practical use it's not much of a 
difference.

 If you think of parse trees when dealing with your grammar

 
 Are those like sentence trees, with noun phrase, verb phrase, etc?

Probably. I don't know linguistics. Example:
S -> A
A -> aBa
B -> bCb
C -> c
C -> A

parsing ababcbaba yields
         S
         |
   _____ A _____
  /      |      \
a   ___ B____   a
    /    |    \
   b   _ C__   b
      /  |  \
     a   B   a
        /|\
       b C b
         |
         c


 - How to handle classic situations


 
 When I first started with Walnut 0.x, I was grinding my brain trying to figure
out how to match these correctly:
 
 {  { /* } */ }  , { } }

it's a matter of what else is allowed in { }. besides, usually /* */ 
comments are handled as whitespace lexemes, solving the problem before 
parsing.

 Another classical problem is JavaScript RegExp literals or divide:
 
 /bob/i  can be "divide bob divide i", or a regexp, depending on whether we
expect an operator or operand.
 
 How would you write that?
 How would the machine read that?

the whole problem of parsing such a thing doesn't arise until embedded 
into the grammar. it therefore depends on what else interferes with this 
syntax. i don't know exactly what's allowed in JavaScript, but you can 
probably distinguish these expressions by the leading / - in a 
arithmetic expression that isn't allowed, therefore the parser tries to 
match a regexp expression. Very simplified:
Expr -> ReEx | ArEx
ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral
ReEx -> '/' StringLiteral '/' OptParameters

Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' 
is not in the first-set of ArEx), the grammar unambiguous.

Jan 10 2008

Dan <murpsoft hotmail.com> writes:

Jascha Wetzel Wrote:

 Dan wrote:
Like MathML, it's way too far from the machine to generate an *efficient*
parser.  

 That depends on the grammar that is being parsed. Simple grammars can 
 often be parsed faster with hand-optimized parsers. The more complex the 
 grammar is, the less impact the generated parsers' overhead has.

Ah, but there's always overhead.

 Apaged was tailored for parsing D, and it's very fast for that. Last 
 time i checked, it parsed the complete Tango package in less than 2 
 seconds (including disk io).

Eep.  That would instantly make any JavaScript interpreter a failure; scripts
need to *run* in the < 500ms (unnoticeable) range to even be considered.

 it's a matter of what else is allowed in { }. besides, usually /* */ 
 comments are handled as whitespace lexemes, solving the problem before 
 parsing.

Aha!  Well then, the way I wrote my scanner/parser, a whole tree is built
before parsing.  It's not fully functional yet, but I'm not seeing any design
failures.

 
 Another classical problem is JavaScript RegExp literals or divide:
 
 /bob/i  can be "divide bob divide i", or a regexp, depending on whether we
expect an operator or operand.
 
 How would you write that?
 How would the machine read that?

 
 the whole problem of parsing such a thing doesn't arise until embedded 
 into the grammar. it therefore depends on what else interferes with this 
 syntax. i don't know exactly what's allowed in JavaScript, but you can 
 probably distinguish these expressions by the leading / - in a 
 arithmetic expression that isn't allowed, therefore the parser tries to 
 match a regexp expression. Very simplified:
 Expr -> ReEx | ArEx
 ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral
 ReEx -> '/' StringLiteral '/' OptParameters
 
 Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' 
 is not in the first-set of ArEx), the grammar unambiguous.

So you mean to say, it does it by looking at the tokens before and after and
inferring the value based on whether an operator or operand is expected?

That makes sense.  I did it similar to that.

Jan 12 2008

Jascha Wetzel <firstname mainia.de> writes:

Dan wrote:
 That depends on the grammar that is being parsed. Simple grammars can 
 often be parsed faster with hand-optimized parsers. The more complex the 
 grammar is, the less impact the generated parsers' overhead has.

 
 Ah, but there's always overhead.

What i meant was, that this overhead has a purpose. Once your grammar 
uses it, it's not overhead anymore.
Automatically generated parsers usually facilitate bottom-up parsing 
algorithms (LALR, GLR) that have several advantages over top-down 
algorithms (LL(k), usually implemented as recursive descent) that are 
used in manually written parsers.
LALR for example is completely iterative (i.e. no stack action, no 
calls, less code -> completely cached), making it more efficient than 
recursive descent. GLR is almost completely iterative, depending on the 
implementation.

If an automatically generated parser is slower, it is usually due to 
features that make it more flexible wrt. to grammars. Therefore, as 
stated initially, once you need this flexibility, the parser performs 
nicely.

Another thing is backtracking. RD can use lookahead, but it is more 
elaborate to implement it manually and generally, top-down lookahead 
isn't as effective as bottom-up lookahead. That is, it takes more 
complex grammars to create the need for backtracking in a bottom-up 
parser than in a top-down parser. That is where bottom-up parsing is 
categorically faster than top-down.

All this makes me claim that it'll be hard to write for example a 
recursive descent D parser by hand, that performs better than it's 
automatically generated GLR counterpart. You might still succeed by a 
small margin, but you will have spent a *lot* more time to do so. And 
you will have a parser that is not near as easily maintainable as the 
grammar used for the automatically generated parser.

 Apaged was tailored for parsing D, and it's very fast for that. Last 
 time i checked, it parsed the complete Tango package in less than 2 
 seconds (including disk io).

 
 Eep.  That would instantly make any JavaScript interpreter a failure; scripts
need to *run* in the < 500ms (unnoticeable) range to even be considered.

The tango package has >400 files with multiple MB of code. I doubt there 
is any script or library written in JavaScript that is even close to 
that size. Parsing single files <1000 LoC is a matter of <1ms with 
apaged or any other LALR/GLR parser, no need to worry ;).

 So you mean to say, it does it by looking at the tokens before and after and
inferring the value based on whether an operator or operand is expected?
 
 That makes sense.  I did it similar to that.

In this case it even only needs to look at the next token. You're 
usually right with what you do to solve such a case if you don't need to 
backtrack. If so, you should double check that there is no other way.

Jan 12 2008

Dan <murpsoft hotmail.com> writes:

Dan Wrote:

 
 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
 
 At the moment, I'm trying recursive descent parsing.

I've got it performing rather nicely now, for it's completeness.

parseOperand()
  returns an { ident, int, double, string, regexp, expression, arrayliteral,
objectliteral *value* }
  incomplete for e notation numbers
  incomplete for expression, arrayliteral, objectliteral

parseBinaryOperator()
  returns one of the binary operator tokens

parseOptionalWS()
  consumes whitespace and doesn't return anything

I've got some more written, but they're not tested.  Once I have numbers done,
I'll try writing the arrayLiteral, which will of course, parseOperand() ","
parseOperand()

: )

Jan 06 2008

Ap Lee <aplee primus.ca> writes:

 Dan Wrote:

 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.


Have you looked at ANTLR? I have used it for two projects and I am quite 
impressed. It does not have a D backend, nor a D grammar AFAIK, but if you 
are writing a parser, you might find a lot of the work has been done for 
you already. If you go for antlr, I strongly recommend the book advertized 
on the antlr.org page, I liked it very much.

Good luck.

Jan 09 2008

Ap Lee <aplee primus.ca> writes:

 Dan Wrote:

 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.


Have you looked at ANTLR? I have used it for two projects and I am quite 
impressed. It does not have a D backend, nor a D grammar AFAIK, but if you 
are writing a parser, you might find a lot of the work has been done for 
you already. If you go for antlr, I strongly recommend the book advertized 
on the antlr.org page, I liked it very much.

Good luck.

Jan 09 2008

Frank Benoit <keinfarbton googlemail.com> writes:

Ap Lee schrieb:
 Have you looked at ANTLR? I have used it for two projects and I am quite 
 impressed. It does not have a D backend, nor a D grammar AFAIK, but if 
 you are writing a parser, you might find a lot of the work has been done 
 for you already. If you go for antlr, I strongly recommend the book 
 advertized on the antlr.org page, I liked it very much.
 
 Good luck.

I haven't tested it, but there is a Antlr D backend:
http://www.mbutscher.de/antlrd/

Jan 09 2008

Alan Knowles <alan akbkhome.com> writes:

I tried antlrd on debian, while I could get it to install (apt-get for 
pure antlr) and download antlrd, All the examples resulted in an error 
message:
#runantlr /tmp/SimpleC.g
/tmp/SimpleC.g:3:1: unexpected token: grammar

Which did not have any google results for potential reasons..

Regards
Alan


Frank Benoit wrote:
 Ap Lee schrieb:
 Have you looked at ANTLR? I have used it for two projects and I am 
 quite impressed. It does not have a D backend, nor a D grammar AFAIK, 
 but if you are writing a parser, you might find a lot of the work has 
 been done for you already. If you go for antlr, I strongly recommend 
 the book advertized on the antlr.org page, I liked it very much.

 Good luck.

 
 I haven't tested it, but there is a Antlr D backend:
 http://www.mbutscher.de/antlrd/

Jan 09 2008

Christoph Singewald <chrisrtoph singewald.at> writes:

Here you can find some parsergenerators:

http://www.prowiki.org/wiki4d/wiki.cgi?GrammarParsers

I use lemonde as parsergenerator and re2c as lexer. Using re2c in most cases
you have only to replace 'unsigned int' with 'uint' in the resulting code.


regards
cs

Dan Wrote:

 
 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
 
 At the moment, I'm trying recursive descent parsing.
 
 The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.
 
 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/
 
 I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing.  Is there any way to ease the pain?
 
 Regards,
 Dan

Jan 10 2008

bearophile <bearophileHUGS lycos.com> writes:

Christoph Singewald:
 I use lemonde as parsergenerator and re2c as lexer. Using re2c in most
 cases you have only to replace 'unsigned int' with 'uint' in the resulting
code.

I have seen re2c and it looks nice. I think its source code can be modified
with not that much efforts to make it produce D code instead of C.

I think C isn't the right language to write such tool: its sources are about
200 KB of C code (plus some code generated by itself), they can probably be
replaced by a 50 (or less) KB Python module (that generates the C/D code).

Something like a tiny but really fast C compiler like TinyCC
(http://fabrice.bellard.free.fr/tcc/) can be used to compile the C code on the
fly in memory and execute it. This may become the starting point to give
run-time compiled Regular Expressions to D :-) Probably there are other smarter
ways to do similar things.

Bye,
bearophile

Jan 10 2008

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Writing a Parser