www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How do I write a "lexer or parser generators"

reply "OP" <749691a80dee 52222a85af35.anonbox.net> writes:
I'd like Walter to reply to this. In his article here 
http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488

Walter says

 Somewhat more controversial, I wouldn't bother wasting time 
 with lexer or parser generators and other so-called "compiler 
 compilers." They're a waste of time. Writing a lexer and parser 
 is a tiny percentage of the job of writing a compiler. Using a 
 generator will take up about as much time as writing one by 
 hand, and it will marry you to the generator (which matters 
 when porting the compiler to a new platform). And generators 
 also have the unfortunate reputation of emitting lousy error 
 messages.

I have no idea how to write one. First off when I originally tried before using anything I wrote nonsense code. Second is I couldn't tell if I had a conflict (either shift reduce or reduce reduce) or if I wrote it wrong/badly. Third is I'm not sure how a tiny one should look like and my language is fairly large. I already understand flex and bison. I'm planning to rewrite from scratch so I could attempt doing so on my own but I don't know how to deal with the above. I just looked at parse.c and I had no idea there is a precedence table. Why is there one rather than it being embedded like a switch statement which tries to handle all the higher precedence operations calling a function running the next set of precedence? I guess that does sound like a loop/table but I imagined it in switch statements. Using bison I can write complex statements which are easy for me to grok and change. I wouldn't know how to change complex statements if I hand wrote the parser. I don't know all the places something like that would affect. Taking the syntax below I'm not sure how to fork a state and discard the invalid one.
 foo[5] = var2
 foo[5] foo = var2

Here when I see foo[5] I'm either accessing an array (first statement) or declaring variables as an array of 5 elements (second statement). Just like `Foo&foo` could be a reference or could be an AND statement. Foo is definitely processed on its own, I don't know how to process it as both (a fork) and continue on the parser to find a valid path. My bison file has >2000 states. I'm not sure how much work it may mean if I handwrote my parser.
Jan 23 2014
next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Thursday, 23 January 2014 at 08:12:08 UTC, OP wrote:
 I'd like Walter to reply to this. In his article here 
 http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488

 Walter says

 Somewhat more controversial, I wouldn't bother wasting time 
 with lexer or parser generators and other so-called "compiler 
 compilers." They're a waste of time. Writing a lexer and 
 parser is a tiny percentage of the job of writing a compiler. 
 Using a generator will take up about as much time as writing 
 one by hand, and it will marry you to the generator (which 
 matters when porting the compiler to a new platform). And 
 generators also have the unfortunate reputation of emitting 
 lousy error messages.

I have no idea how to write one. First off when I originally tried before using anything I wrote nonsense code. Second is I couldn't tell if I had a conflict (either shift reduce or reduce reduce) or if I wrote it wrong/badly. Third is I'm not sure how a tiny one should look like and my language is fairly large. I already understand flex and bison. I'm planning to rewrite from scratch so I could attempt doing so on my own but I don't know how to deal with the above. I just looked at parse.c and I had no idea there is a precedence table. Why is there one rather than it being embedded like a switch statement which tries to handle all the higher precedence operations calling a function running the next set of precedence? I guess that does sound like a loop/table but I imagined it in switch statements. Using bison I can write complex statements which are easy for me to grok and change. I wouldn't know how to change complex statements if I hand wrote the parser. I don't know all the places something like that would affect. Taking the syntax below I'm not sure how to fork a state and discard the invalid one.
 foo[5] = var2
 foo[5] foo = var2

Here when I see foo[5] I'm either accessing an array (first statement) or declaring variables as an array of 5 elements (second statement). Just like `Foo&foo` could be a reference or could be an AND statement. Foo is definitely processed on its own, I don't know how to process it as both (a fork) and continue on the parser to find a valid path. My bison file has >2000 states. I'm not sure how much work it may mean if I handwrote my parser.

Hand written parsers are usually LL(K), you just need to write the grammar in a way that is LL(K) compliant. Granted there are a few cases where only a LALR(k) parser will do (yacc and friends), but most context free grammars tend to be LL(K) as well. There are lots of tutorials how to write such types of compilers, for example, http://compilers.iecc.com/crenshaw/ http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf http://llvm.org/docs/tutorial/ -- Paulo
Jan 23 2014
prev sibling next sibling parent reply "Daniel Murphy" <yebbliesnospam gmail.com> writes:
 "OP"  wrote in message news:qtjpbqxlabjerowdlkfo forum.dlang.org... I just 
 looked at parse.c and I had no idea there is a precedence table. Why is 
 there one rather than it being embedded like a switch statement which 
 tries to handle all the higher precedence operations calling a function 
 running the next set of precedence? I guess that does sound like a 
 loop/table but I imagined it in switch statements.

The precedence table is mostly just used for leaving off redundant parens when printing expressions, not for parsing.
Jan 23 2014
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/23/2014 10:06 AM, Daniel Murphy wrote:
 "OP"  wrote in message news:qtjpbqxlabjerowdlkfo forum.dlang.org... I
 just looked at parse.c and I had no idea there is a precedence table.
 Why is there one rather than it being embedded like a switch statement
 which tries to handle all the higher precedence operations calling a
 function running the next set of precedence? I guess that does sound
 like a loop/table but I imagined it in switch statements.

The precedence table is mostly just used for leaving off redundant parens when printing expressions, not for parsing.

Encoding precedence by function calls is unnecessarily wasteful though. (e.g. "a.b" requires roughly 14 recursive function calls to be parsed by such a scheme.)
Jan 23 2014
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/23/2014 09:12 AM, OP wrote:
 Taking the syntax below I'm not sure how to fork a state and discard the
 invalid one.

 foo[5] = var2
 foo[5] foo = var2

Here when I see foo[5] I'm either accessing an array (first statement) or declaring variables as an array of 5 elements (second statement). Just like `Foo&foo` could be a reference or could be an AND statement. Foo is definitely processed on its own, I don't know how to process it as both (a fork) and continue on the parser to find a valid path.

A simple way to deal with this is to determine whether the parser is at the beginning of a declaration or an expression statement, eg. save the current position in the input stream, and try to skip (parse without building an AST) something looking like a declaration, remember if it worked and reset the position. If this succeeds, parse the declaration, otherwise parse an expression statement.
Jan 23 2014
prev sibling parent "Boyd" <gaboonviper gmx.net> writes:
On Thursday, 23 January 2014 at 08:12:08 UTC, OP wrote:
 I'd like Walter to reply to this. In his article here 
 http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488

 Walter says

 Somewhat more controversial, I wouldn't bother wasting time 
 with lexer or parser generators and other so-called "compiler 
 compilers." They're a waste of time. Writing a lexer and 
 parser is a tiny percentage of the job of writing a compiler. 
 Using a generator will take up about as much time as writing 
 one by hand, and it will marry you to the generator (which 
 matters when porting the compiler to a new platform). And 
 generators also have the unfortunate reputation of emitting 
 lousy error messages.

Using bison I can write complex statements which are easy for me to grok and change. I wouldn't know how to change complex statements if I hand wrote the parser. I don't know all the places something like that would affect. Taking the syntax below I'm not sure how to fork a state and discard the invalid one.
 foo[5] = var2
 foo[5] foo = var2

Here when I see foo[5] I'm either accessing an array (first statement) or declaring variables as an array of 5 elements (second statement). Just like `Foo&foo` could be a reference or could be an AND statement. Foo is definitely processed on its own, I don't know how to process it as both (a fork) and continue on the parser to find a valid path.

A simple solution to this problem would be to try and parse the longer statement first. If that doesn't work, then go for the shorter one. In your first case you have two possible declarations: - the 'variable definition': [Type] [Name] ('=' [R-Expression])? - the 'value assignment'. [L-Expression] '=' [R-Expression] When you can't parse a variable definition, try the value definition. I found that writing a parser in general is not that hard. The difficult part is figuring out what the syntax and the AST should look like. I recommend just trying to create something for a limited set of your language and work your way up from there.
Jan 23 2014