www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Writing a JFlex lexer for D - have an issue with cycles

reply FatalCatharsis <wykennedy gmail.com> writes:
I'm writing a flex lexer for D and I've hit a roadblock. It is 
almost working EXCEPT for one specific production.

StringLiteral is cyclic and I don't know how to approach it. It 
is cyclic because:

      Token -> StringLiteral -> TokenString -> Token

To break the cycle, I was thinking I could just make a production 
which is Token sans StringLiteral and instead subbed with a 
production for StringLiteral that does not contain TokenString, 
but that fundamentally changes the language. Should the lexer 
really handle something like:

     q{blah1q{20q{"meh"q{20.1q{blah}}}}}

Lexically I don't know how this makes sense. To be clear, I'm 
wondering if this is acceptable:

     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{ TokenNonNestedTokenStrings }


      TokenNonNestedTokenStrings:
         TokenNonNestedTokenString
         TokenNonNestedTokenString TokenNonNestedTokenStrings

      TokenNonNestedTokenString:
         Identifier
         StringLiteralNonNestedTokenString
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

Which basically disables nested token strings. Has anyone else 
run into this issue?
Jan 22
next sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
 I'm writing a flex lexer for D and I've hit a roadblock. It is 
 almost working EXCEPT for one specific production.

 StringLiteral is cyclic and I don't know how to approach it. It 
 is cyclic because:

      Token -> StringLiteral -> TokenString -> Token

 To break the cycle, I was thinking I could just make a 
 production which is Token sans StringLiteral and instead subbed 
 with a production for StringLiteral that does not contain 
 TokenString, but that fundamentally changes the language. 
 Should the lexer really handle something like:

     q{blah1q{20q{"meh"q{20.1q{blah}}}}}

 Lexically I don't know how this makes sense. To be clear, I'm 
 wondering if this is acceptable:

     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{ TokenNonNestedTokenStrings }


      TokenNonNestedTokenStrings:
         TokenNonNestedTokenString
         TokenNonNestedTokenString TokenNonNestedTokenStrings

      TokenNonNestedTokenString:
         Identifier
         StringLiteralNonNestedTokenString
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

 Which basically disables nested token strings. Has anyone else 
 run into this issue?
One way to do this is not to do anything special for q{. Just add a token for q{ and continue normal lexing. The token string content must be valid tokens so it should work. In facts it depends on what the scanner just be used for...for serious compiler things this is not acceptable. For highlighting or tools this might be okay. The real problem is that a token string should be printable... so it exists as a whole before being tokenized.
Jan 22
parent reply FatalCatharsis <wykennedy gmail.com> writes:
On Sunday, 22 January 2017 at 23:20:27 UTC, Basile B. wrote:
 One way to do this is not to do anything special for q{. Just 
 add a token for q{ and continue normal lexing. The token string 
 content must be valid tokens so it should work.

 In facts it depends on what the scanner just be used for...for 
 serious compiler things this is not acceptable. For 
 highlighting or tools this might be okay. The real problem is 
 that a token string should be printable... so it exists as a 
 whole before being tokenized.
This is for tooling, but I still want it to be accurate. Are you suggesting I could move handling token string to the syntax side of things? I could remove it from the lexer entirely and do like you said, have a token g{ and a syntax production that uses it like so: Token ::= IDENTIFIER | STRING_LITERAL | TokenString | CHARACTER_LITERAL | INTEGER_LITERAL | FLOAT_LITERAL | KEYWORD | OPERATOR TokenString ::= TOKEN_STRING_OP Token RIGHT_CURLY_OP and then anywhere where there is a StringLiteral (e.g. StringLiterals, AsmInstruction, TemplateSingleArgument), then provide an equivalent production that uses TokenString in place of StringLiteral. Would this be equivalent? (Just want another set of eyes before I call it good.)
Jan 22
parent Basile B. <b2.temp gmx.com> writes:
On Monday, 23 January 2017 at 00:03:44 UTC, FatalCatharsis wrote:
 On Sunday, 22 January 2017 at 23:20:27 UTC, Basile B. wrote:
 One way to do this is not to do anything special for q{. [...]
This is for tooling, but I still want it to be accurate. Are you suggesting I could move handling token string to the syntax side of things? I could remove it from the lexer entirely and do like you said, have a token g{ and a syntax production that uses it like so: Token ::= IDENTIFIER | STRING_LITERAL | TokenString | CHARACTER_LITERAL | INTEGER_LITERAL | FLOAT_LITERAL | KEYWORD | OPERATOR TokenString ::= TOKEN_STRING_OP Token RIGHT_CURLY_OP [...]
Yes this was the idea but "Profile Anaysis" first answer is better. You really should have a single string for any token string, even if it's nested. Until the token string is used it doesn't need to be tokenized. The solution that's to tokenize the string is only useful for highlighting.
Jan 22
prev sibling parent reply Profile Anaysis <PA gotacha.com> writes:
On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
 I'm writing a flex lexer for D and I've hit a roadblock. It is 
 almost working EXCEPT for one specific production.

 StringLiteral is cyclic and I don't know how to approach it. It 
 is cyclic because:

      Token -> StringLiteral -> TokenString -> Token

 To break the cycle, I was thinking I could just make a 
 production which is Token sans StringLiteral and instead subbed 
 with a production for StringLiteral that does not contain 
 TokenString, but that fundamentally changes the language. 
 Should the lexer really handle something like:

     q{blah1q{20q{"meh"q{20.1q{blah}}}}}

 Lexically I don't know how this makes sense. To be clear, I'm 
 wondering if this is acceptable:

     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{ TokenNonNestedTokenStrings }


      TokenNonNestedTokenStrings:
         TokenNonNestedTokenString
         TokenNonNestedTokenString TokenNonNestedTokenStrings

      TokenNonNestedTokenString:
         Identifier
         StringLiteralNonNestedTokenString
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

 Which basically disables nested token strings. Has anyone else 
 run into this issue?
This is not really any different than any nesting problem. First, you must realize that it is more like a "spiral" than a circle. It is recursive in this sense: Token_n -> StringLiteral_n -> TokenString_n -> Token_(n-1) and eventually Token_(n-1) must terminate the chain(essentially the rule must fail for some n). But this is ok because parsers are recursive in nature, so you just have to have your rule be able to terminate in a logical way. What we can say about q{} string literal is that either contains other string literals or it doesn't. If it doesn't, that is all that is required because a nested set of string literals will have to contain at least one that is not nested. That collapses the whole chain. So, the way I see it is that you really need your TokenString to have a nested and non-nested version. The nested version will cycle but also as an out. e.g.,
     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{StringLiteralNonNestedTokenString}  <- Terminal
q{NestedTokenString}
      NestedTokenString:
Tokens TokenString Tokens
      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
!q{ !} So, a TokenString can be a a "simple string" that doesn't nest or it can have nesting. That nesting can go on for every if the source has it, and this grammar can handle it. But If a tokenstring, regardless if it is inside another token string, terminates/is not nesting another token string, then you are fine too because that catches the eventual termination and will back propagate through the nesting to determine the other tokens. The real issue is ambiguity. Any time you have a cycle you must be able to get out of it and so your rules must be organized so that one always checks to see if termination has occurred before checking for nesting. If you allow, say } as an element of a stringliteral then it will be ambiguous as the grammar will try to match it, when it is meant to be a literal. I'm not 100% sure about your grammar expression but basically q{blah1 <- q{NestedTokenString}/nonterminal q{20 <- q{NestedTokenString}/nonterminal q{"meh" <- q{NestedTokenString}/nonterminal q{20.1 <- q{NestedTokenString}/nonterminal q{blah} <- q{StringLiteralNonNestedTokenString}/terminal } } } } Basically as we parse the string, we encounter q{'s and this cause a "cycle"(but it really is a spiral ;). When we get to the final one we see that it is a terminal because it doesn't contain any q{ inside. Hence at that point the grammar unwinds and builds the tree quite easily. Again, it is really no different than ()'s and such. One just has to make sure that, say, ( and ) not sued for anything else in an ambiguous way. e.g., (3+)) means what? (suppose ) also was the same as 4. so someone things (3+4) = 7 but the compiler cannot distinguish between them. (it could be made to in some ways but the grammar is then not context free) You shouldn't have to worry too much about those cases but you do have to make sure your grammar understands that the tokens it uses to determine a rule cannot be interpreted in any other way along that parsed branch. Else you end up with an ambiguous or context sensitive grammar.
Jan 22
next sibling parent reply Chris Wright <dhasenan gmail.com> writes:
On Mon, 23 Jan 2017 00:46:30 +0000, Profile Anaysis wrote:
 But this is ok because parsers are recursive in nature, so you just have
 to have your rule be able to terminate in a logical way.
Except it's part of the lexer, and most people (such as those creating the JFlex lexer generator) assume that no lexer needs to be recursive.
Jan 22
parent FatalCatharsis <wykennedy gmail.com> writes:
On Monday, 23 January 2017 at 01:40:28 UTC, Chris Wright wrote:
 On Mon, 23 Jan 2017 00:46:30 +0000, Profile Anaysis wrote:
 But this is ok because parsers are recursive in nature, so you 
 just have to have your rule be able to terminate in a logical 
 way.
Except it's part of the lexer, and most people (such as those creating the JFlex lexer generator) assume that no lexer needs to be recursive.
I was under the impression that a lexer could not be recursive by nature, that it was up to the parser to handle recursion and nesting productions. Can you provide articles/examples showing how a lexer can handle recursive cases?
Jan 22
prev sibling parent reply FatalCatharsis <wykennedy gmail.com> writes:
On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysis wrote:
 The real issue is ambiguity. Any time you have a cycle you must 
 be able to get out of it and so your rules must be organized so 
 that one always checks to see if termination has occurred 
 before checking for nesting. If you allow, say } as an element 
 of a stringliteral then it will be ambiguous as the grammar 
 will try to match it, when it is meant to be a literal.
This brings up another question for me. Isn't the token string production ambiguous by itself? q{ tokens } How does it know whether the input "}" is another token or the terminator on a token string?
Jan 22
parent Profile Anaysis <PA gotacha.com> writes:
On Monday, 23 January 2017 at 01:46:58 UTC, FatalCatharsis wrote:
 On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysis 
 wrote:
 The real issue is ambiguity. Any time you have a cycle you 
 must be able to get out of it and so your rules must be 
 organized so that one always checks to see if termination has 
 occurred before checking for nesting. If you allow, say } as 
 an element of a stringliteral then it will be ambiguous as the 
 grammar will try to match it, when it is meant to be a literal.
This brings up another question for me. Isn't the token string production ambiguous by itself? q{ tokens } How does it know whether the input "}" is another token or the terminator on a token string?
You know how an if statement in many languages has the feature of Short-circuit evaluation? if (X | Y) X is first tested, and if it is true, there is no need to test Y. Now lets say that X is true at some point but Y is "recursive" in some sense, while X is false, Y is evaluated but once X becomes true there is no need for Y to be evaluated. This "analogy" is basically what is happening in the grammar for nested rules. As long as we have a terminating rule that is checked first and bypasses the non-recursive rules, we will terminate. So, to understand this easily we think about this way for any type of nested things. 1. Create the terminating instance(e.g., the X above) that has no recursive properties/non-terminating behavior. 2. Create the recursive instance(the Y). Then effectively do the if(X | Y). This generally can be directly mapped in to a grammar because the if statements are implicit in the rules(since it is a decision tree). So, to "create X", which is simply a string literal, it has the structure q{...}. This structure must be unambiguous for .... This means that ... must not have q{ or } in it unless they are escaped in some way(which essentially just changes them in to something else. e.g., \n is not \ and n but a completely different character as far as the grammar is concerned. Now, that is the terminal rule. It can't recurse because ... will be parsed as "whatever" and that whatever can't be, by design(we hope), be re-parsed as itself(in any way). Once we have that, we create the non-terminal recursive rule: q{...} in which ... can be itself(or something else which then is this, or whatever. You basically have this stuff(you might have to make sure the first case actually is a terminal and can't recurse on itself. Once you have that, it is easy to create a terminating grammar: Y = q{X | Y} First X is checked, if it is found then Y terminates, else we try Y. Such a rule as the above can be expanded. Y = q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | ... Y} so, this only allows strings like q{X} q{q{X}} q{q{q{X}}} q{q{q{q{X}}}} q{q{q{q{q{X}}}}} etc... But they always terminate as long as X terminates. The compilers logic is this: Check if the tokens look like q{ then some form of X, if no X then check if they match Y, which requires reapplying the whole check, then a }. Suppose our rule is Y = q{1X | 2Y} q{1X} q{2q{1X}} q{2q{2q{1X}}} q{2q{2q{2q{1X}}}} q{2q{2q{2q{2q{1X}}}}} would be the valid matches. If we allows other alternates like Y = q{X | (A | B) } A = l{Y} B = ($|%)A then things like q{X} q{l{q{X}} q{l{q{l{q{X}}}}} (basically A is Y as before but with an l q{l{q{l{X}}}} <- not valid because l{X} is not in the rule above, it must be of type l{Y} and Y always starts with a q{. The rule actually fails at the inner most l{l{X}} because the l{X} is not matched(no rule) q{l{q{$l{X}}}} is valid, This is the rules applied in this form Y->A->B->Y. You can see, with the proper rules we can generate just about any form we want. The things to keep track of is that: X must never contain tokens/matches that allow the other non-terminals to fail, unless that is the goal and X must be a terminal. In the above example, if X could contain l, {, }, $, and/or % then we have no way of knowing if our tokens are really X or Y. e.g., with q{l{q{l{q{X}}}}}, X could just be l{q{l{q{X}}}} and we would terminate immediately. This would make our ability to parse the nesting impossible. We can say that X must be "disjoint" from Y for Y the grammar to function as intended. Second, you must list your rules in the short circuit order so that the terminate is always checked first. This is because the Y rules may actually contain X(they do in fact, and must if it is recursive). Hence, if we check the Y rules first, we will end up in a cycle because we will never be able to determine that the Y rule is really an X and X is what allows us to terminate the cycle in the first place. It is not really difficult as it seems. It is just a way of thinking and expressing things that makes the difference. Remember that a grammar is effectively just a giant "if statement machine" and so similar logic holds. Just as we can do invalid things with if statements we can do similar things with gammars. The grammars do exactly as we tell them. Maybe we want infinite cycles! Maybe want them to terminate only after 150 loops, e.g., Y = Y_(n<150) | X which could be read to use the rule Y for 150 times and then switch to rule X. (this is just a sort of spinwait in the grammar). e.g., if our grammar was expressing how enemies think in video game, this could be seen as a sort of pause in their thinking. If could be used to draw a fractal where we could have several parallel grammars workings at the same time and this burns up cycles while other grammars do things. It is a contrived example, of course... For programming languages, we do not want complexities that we can't wrap our mind around(since programming languages are the basis for building complex programs it would make the complexity too difficult to handle in the long run). The rule to solve these problems is to use the short-circuit concept as then you can list your rules in order from least complex to most. The least complex, if a terminal, when then always allow your rules to terminate if it is properly matched. So your job is to construct your terminating string literal by knowing what makes the tokens not terminate... then simply make sure they can't be in your terminating string literal. Second, construct your non-terminating rules to be how you want the nesting to be. This is a general procedure. Any "sequence" of things have these properties. A literal in a grammar is simply a terminal rule that allows all the other rules to terminate at some point. Without them, we can't build a structure than can terminate. e.g., lexing an integer integer = Digit | (Digit & (Digit | (...) = Digit* Is a recursive rule. It's just that the recursion as 0 depth in some sense or that we use meta rules to help make the rule succinct(e.g., the *). Digit, a terminal, terminates the recursion. We could write it as integer = Digit | DigitSeq DigitSeq = Digit | DigitSeq or integer = Digit | DigitSeq DigitSeq = Digit | integer or integer = Digit | integer (which, in this case is simple because any integer can be seen as two integers concatenated, except for a single digit) but, of course integer = integer | Digit makes no sense because it is effectively an infinite loop(we have to try to match some terminal to to get somewhere) Obviously things can get tricky, but programming languages are designed so that ambiguities and complexity is not desirable. Programmers are in the business of programming to get a job done. If they burn needless cycles(like the spin loop) for no benefit, then they are less efficient at their job. Hopefully that helps some.
Jan 23