digitalmars.D - Writing a JFlex lexer for D

digitalmars.D - Writing a JFlex lexer for D - have an issue with cycles

FatalCatharsis (49/49) Jan 22 2017 I'm writing a flex lexer for D and I've hit a roadblock. It is

Basile B. (9/58) Jan 22 2017 One way to do this is not to do anything special for q{. Just add

FatalCatharsis (21/29) Jan 22 2017 This is for tooling, but I still want it to be accurate. Are you

Basile B. (7/26) Jan 22 2017 Yes this was the idea but "Profile Anaysis" first answer is

Profile Anaysis (68/141) Jan 22 2017 This is not really any different than any nesting problem.

Chris Wright (3/5) Jan 22 2017 Except it's part of the lexer, and most people (such as those creating t...

FatalCatharsis (6/13) Jan 22 2017 I was under the impression that a lexer could not be recursive by

FatalCatharsis (6/12) Jan 22 2017 This brings up another question for me. Isn't the token string

Profile Anaysis (155/168) Jan 23 2017 You know how an if statement in many languages has the feature of

FatalCatharsis <wykennedy gmail.com> writes:

I'm writing a flex lexer for D and I've hit a roadblock. It is 
almost working EXCEPT for one specific production.

StringLiteral is cyclic and I don't know how to approach it. It 
is cyclic because:

      Token -> StringLiteral -> TokenString -> Token

To break the cycle, I was thinking I could just make a production 
which is Token sans StringLiteral and instead subbed with a 
production for StringLiteral that does not contain TokenString, 
but that fundamentally changes the language. Should the lexer 
really handle something like:

     q{blah1q{20q{"meh"q{20.1q{blah}}}}}

Lexically I don't know how this makes sense. To be clear, I'm 
wondering if this is acceptable:

     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{ TokenNonNestedTokenStrings }


      TokenNonNestedTokenStrings:
         TokenNonNestedTokenString
         TokenNonNestedTokenString TokenNonNestedTokenStrings

      TokenNonNestedTokenString:
         Identifier
         StringLiteralNonNestedTokenString
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

Which basically disables nested token strings. Has anyone else 
run into this issue?

Jan 22 2017

Basile B. <b2.temp gmx.com> writes:

On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
 I'm writing a flex lexer for D and I've hit a roadblock. It is 
 almost working EXCEPT for one specific production.

 StringLiteral is cyclic and I don't know how to approach it. It 
 is cyclic because:

      Token -> StringLiteral -> TokenString -> Token

 To break the cycle, I was thinking I could just make a 
 production which is Token sans StringLiteral and instead subbed 
 with a production for StringLiteral that does not contain 
 TokenString, but that fundamentally changes the language. 
 Should the lexer really handle something like:

     q{blah1q{20q{"meh"q{20.1q{blah}}}}}

 Lexically I don't know how this makes sense. To be clear, I'm 
 wondering if this is acceptable:

     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{ TokenNonNestedTokenStrings }


      TokenNonNestedTokenStrings:
         TokenNonNestedTokenString
         TokenNonNestedTokenString TokenNonNestedTokenStrings

      TokenNonNestedTokenString:
         Identifier
         StringLiteralNonNestedTokenString
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

 Which basically disables nested token strings. Has anyone else 
 run into this issue?

One way to do this is not to do anything special for q{. Just add 
a token for q{ and continue normal lexing. The token string 
content must be valid tokens so it should work.

In facts it depends on what the scanner just be used for...for 
serious compiler things this is not acceptable. For highlighting 
or tools this might be okay. The real problem is that a token 
string should be printable... so it exists as a whole before 
being tokenized.

Jan 22 2017

FatalCatharsis <wykennedy gmail.com> writes:

On Sunday, 22 January 2017 at 23:20:27 UTC, Basile B. wrote:
 One way to do this is not to do anything special for q{. Just 
 add a token for q{ and continue normal lexing. The token string 
 content must be valid tokens so it should work.

 In facts it depends on what the scanner just be used for...for 
 serious compiler things this is not acceptable. For 
 highlighting or tools this might be okay. The real problem is 
 that a token string should be printable... so it exists as a 
 whole before being tokenized.

This is for tooling, but I still want it to be accurate. Are you 
suggesting I could move handling token string to the syntax side 
of things? I could remove it from the lexer entirely and do like 
you said, have a token g{ and a syntax production that uses it 
like so:

      Token ::=
          IDENTIFIER |
          STRING_LITERAL |
          TokenString |
          CHARACTER_LITERAL |
          INTEGER_LITERAL |
          FLOAT_LITERAL |
          KEYWORD |
          OPERATOR

      TokenString ::= TOKEN_STRING_OP Token RIGHT_CURLY_OP

and then anywhere where there is a StringLiteral (e.g. 
StringLiterals, AsmInstruction, TemplateSingleArgument), then 
provide an equivalent production that uses TokenString in place 
of StringLiteral. Would this be equivalent? (Just want another 
set of eyes before I call it good.)

Jan 22 2017

Basile B. <b2.temp gmx.com> writes:

On Monday, 23 January 2017 at 00:03:44 UTC, FatalCatharsis wrote:
 On Sunday, 22 January 2017 at 23:20:27 UTC, Basile B. wrote:
 One way to do this is not to do anything special for q{. [...]

 This is for tooling, but I still want it to be accurate. Are 
 you suggesting I could move handling token string to the syntax 
 side of things? I could remove it from the lexer entirely and 
 do like you said, have a token g{ and a syntax production that 
 uses it like so:

      Token ::=
          IDENTIFIER |
          STRING_LITERAL |
          TokenString |
          CHARACTER_LITERAL |
          INTEGER_LITERAL |
          FLOAT_LITERAL |
          KEYWORD |
          OPERATOR

      TokenString ::= TOKEN_STRING_OP Token RIGHT_CURLY_OP

 [...]

Yes this was the idea but "Profile Anaysis" first answer is 
better. You really should have a single string for any token 
string, even if it's nested. Until the token string is used it 
doesn't need to be tokenized.

The solution that's to tokenize the string is only useful for 
highlighting.

Jan 22 2017

Profile Anaysis <PA gotacha.com> writes:

On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
 I'm writing a flex lexer for D and I've hit a roadblock. It is 
 almost working EXCEPT for one specific production.

 StringLiteral is cyclic and I don't know how to approach it. It 
 is cyclic because:

      Token -> StringLiteral -> TokenString -> Token

 To break the cycle, I was thinking I could just make a 
 production which is Token sans StringLiteral and instead subbed 
 with a production for StringLiteral that does not contain 
 TokenString, but that fundamentally changes the language. 
 Should the lexer really handle something like:

     q{blah1q{20q{"meh"q{20.1q{blah}}}}}

 Lexically I don't know how this makes sense. To be clear, I'm 
 wondering if this is acceptable:

     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{ TokenNonNestedTokenStrings }


      TokenNonNestedTokenStrings:
         TokenNonNestedTokenString
         TokenNonNestedTokenString TokenNonNestedTokenStrings

      TokenNonNestedTokenString:
         Identifier
         StringLiteralNonNestedTokenString
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

 Which basically disables nested token strings. Has anyone else 
 run into this issue?

This is not really any different than any nesting problem.

First, you must realize that it is more like a "spiral" than a 
circle. It is recursive in this sense:

       Token_n -> StringLiteral_n -> TokenString_n -> Token_(n-1)


and eventually Token_(n-1) must terminate the chain(essentially 
the rule must fail for some n).

But this is ok because parsers are recursive in nature, so you 
just have to have your rule be able to terminate in a logical way.

What we can say about q{} string literal is that either contains 
other string literals or it doesn't.

If it doesn't, that is all that is required because a nested set 
of string literals will have to contain at least one that is not 
nested. That collapses the whole chain.


So, the way I see it is that you really need your TokenString to 
have a nested and non-nested version. The nested version will 
cycle but also as an out.

e.g.,
     Token:
         Identifier
         StringLiteral
         CharacterLiteral
         IntegerLiteral
         FloatLiteral
         Keyword
         Operator

      StringLiteral:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString
         TokenString

      TokenString:
         q{StringLiteralNonNestedTokenString}  <- Terminal

           q{NestedTokenString}

      NestedTokenString:

           Tokens
           TokenString
           Tokens
      StringLiteralNonNestedTokenString:
         WysiwygString
         AlternateWysiwygString
         DoubleQuotedString
         HexString
         DelimitedString

           !q{
           !}





So, a TokenString can be a a "simple string" that doesn't nest or 
it can have nesting. That nesting can go on for every if the 
source has it, and this grammar can handle it.

But If a tokenstring, regardless if it is inside another token 
string, terminates/is not nesting another token string, then you 
are fine too because that catches the eventual termination and 
will back propagate through the nesting to determine the other 
tokens.

The real issue is ambiguity. Any time you have a cycle you must 
be able to get out of it and so your rules must be organized so 
that one always checks to see if termination has occurred before 
checking for nesting. If you allow, say } as an element of a 
stringliteral then it will be ambiguous as the grammar will try 
to match it, when it is meant to be a literal.

I'm not 100% sure about your grammar expression but basically


q{blah1 <- q{NestedTokenString}/nonterminal
        q{20 <- q{NestedTokenString}/nonterminal
            q{"meh" <- q{NestedTokenString}/nonterminal
                   q{20.1 <- q{NestedTokenString}/nonterminal
                         q{blah} <- 
q{StringLiteralNonNestedTokenString}/terminal
                     }
              }
         }
  }

Basically as we parse the string, we encounter q{'s and this 
cause a "cycle"(but it really is a spiral ;).

When we get to the final one we see that it is a terminal because 
it doesn't contain any q{ inside.

Hence at that point the grammar unwinds and builds the tree quite 
easily.

Again, it is really no different than ()'s and such.


One just has to make sure that, say, ( and ) not sued for 
anything else in an ambiguous way.

e.g., (3+)) means what? (suppose ) also was the same as 4. so 
someone things (3+4) = 7 but the compiler cannot distinguish 
between them.

(it could be made to in some ways but the grammar is then not 
context free)

You shouldn't have to worry too much about those cases but you do 
have to make sure your grammar understands that the tokens it 
uses to determine a rule cannot be interpreted in any other way 
along that parsed branch. Else you end up with an ambiguous or 
context sensitive grammar.

Jan 22 2017

Chris Wright <dhasenan gmail.com> writes:

On Mon, 23 Jan 2017 00:46:30 +0000, Profile Anaysis wrote:
 But this is ok because parsers are recursive in nature, so you just have
 to have your rule be able to terminate in a logical way.

Except it's part of the lexer, and most people (such as those creating the 
JFlex lexer generator) assume that no lexer needs to be recursive.

Jan 22 2017

FatalCatharsis <wykennedy gmail.com> writes:

On Monday, 23 January 2017 at 01:40:28 UTC, Chris Wright wrote:
 On Mon, 23 Jan 2017 00:46:30 +0000, Profile Anaysis wrote:
 But this is ok because parsers are recursive in nature, so you 
 just have to have your rule be able to terminate in a logical 
 way.

 Except it's part of the lexer, and most people (such as those 
 creating the JFlex lexer generator) assume that no lexer needs 
 to be recursive.

I was under the impression that a lexer could not be recursive by 
nature, that it was up to the parser to handle recursion and 
nesting productions.

Can you provide articles/examples showing how a lexer can handle 
recursive cases?

Jan 22 2017

FatalCatharsis <wykennedy gmail.com> writes:

On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysis wrote:
 The real issue is ambiguity. Any time you have a cycle you must 
 be able to get out of it and so your rules must be organized so 
 that one always checks to see if termination has occurred 
 before checking for nesting. If you allow, say } as an element 
 of a stringliteral then it will be ambiguous as the grammar 
 will try to match it, when it is meant to be a literal.

This brings up another question for me. Isn't the token string 
production ambiguous by itself?

q{ tokens }

How does it know whether the input "}" is another token or the 
terminator on a token string?

Jan 22 2017

Profile Anaysis <PA gotacha.com> writes:

On Monday, 23 January 2017 at 01:46:58 UTC, FatalCatharsis wrote:
 On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysis 
 wrote:
 The real issue is ambiguity. Any time you have a cycle you 
 must be able to get out of it and so your rules must be 
 organized so that one always checks to see if termination has 
 occurred before checking for nesting. If you allow, say } as 
 an element of a stringliteral then it will be ambiguous as the 
 grammar will try to match it, when it is meant to be a literal.

 This brings up another question for me. Isn't the token string 
 production ambiguous by itself?

 q{ tokens }

 How does it know whether the input "}" is another token or the 
 terminator on a token string?

You know how an if statement in many languages has the feature of 
Short-circuit evaluation?

if (X | Y)

X is first tested, and if it is true, there is no need to test Y.

Now lets say that X is true at some point but Y is "recursive" in 
some sense, while X is false, Y is evaluated but once X becomes 
true there is no need for Y to be evaluated.

This "analogy" is basically what is happening in the grammar for 
nested rules.

As long as we have a terminating rule that is checked first and 
bypasses the non-recursive rules, we will terminate.

So, to understand this easily we think about this way for any 
type of nested things.

1. Create the terminating instance(e.g., the X above) that has no 
recursive properties/non-terminating behavior.

2. Create the recursive instance(the Y).

Then effectively do the if(X | Y). This generally can be directly 
mapped in to a grammar because the if statements are implicit in 
the rules(since it is a decision tree).

So, to "create X", which is simply a string literal, it has the 
structure q{...}.

This structure must be unambiguous for .... This means that ... 
must not have q{ or } in it unless they are escaped in some 
way(which essentially just changes them in to something else. 
e.g., \n is not \ and n but a completely different character as 
far as the grammar is concerned.

Now, that is the terminal rule. It can't recurse because ... will 
be parsed as "whatever" and that whatever can't be, by design(we 
hope), be re-parsed as itself(in any way).


Once we have that, we create the non-terminal recursive rule:

q{...} in which ... can be itself(or something else which then is 
this, or whatever.

You basically have this stuff(you might have to make sure the 
first case actually is a terminal and can't recurse on itself.

Once you have that, it is easy to create a terminating grammar:

Y = q{X | Y}

First X is checked, if it is found then Y terminates, else we try 
Y.

Such a rule as the above can be expanded.

Y = q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | ... Y}

so, this only allows strings like

q{X}
q{q{X}}
q{q{q{X}}}
q{q{q{q{X}}}}
q{q{q{q{q{X}}}}}

etc...

But they always terminate as long as X terminates.

The compilers logic is this:

Check if the tokens look like q{ then some form of X, if no X 
then check if they match Y, which requires reapplying the whole 
check, then a }.

Suppose our rule is

Y = q{1X | 2Y}

q{1X}
q{2q{1X}}
q{2q{2q{1X}}}
q{2q{2q{2q{1X}}}}
q{2q{2q{2q{2q{1X}}}}}

would be the valid matches.


If we allows other alternates like

Y = q{X | (A | B) }
A = l{Y}
B = ($|%)A

then things like

q{X}

q{l{q{X}}
q{l{q{l{q{X}}}}} (basically A is Y as before but with an l
q{l{q{l{X}}}} <- not valid because l{X} is not in the rule above, 
it must be of type l{Y} and Y always starts with a q{. The rule 
actually fails at the inner most l{l{X}} because the l{X} is not 
matched(no rule)

q{l{q{$l{X}}}} is valid, This is the rules applied in this form 
Y->A->B->Y.


You can see, with the proper rules we can generate just about any 
form we want.

The things to keep track of is that:

X must never contain tokens/matches that allow the other 
non-terminals to fail, unless that is the goal and X must be a 
terminal. In the above example, if X could contain l, {, }, $, 
and/or % then we have no way of knowing if our tokens are really 
X or Y. e.g., with q{l{q{l{q{X}}}}}, X could just be 
l{q{l{q{X}}}} and we would terminate immediately. This would make 
our ability to parse the nesting impossible. We can say that X 
must be "disjoint" from Y for Y the grammar to function as 
intended.

Second, you must list your rules in the short circuit order so 
that the terminate is always checked first. This is because the Y 
rules may actually contain X(they do in fact, and must if it is 
recursive).

Hence, if we check the Y rules first, we will end up in a cycle 
because we will never be able to determine that the Y rule is 
really an X and X is what allows us to terminate the cycle in the 
first place.

It is not really difficult as it seems. It is just a way of 
thinking and expressing things that makes the difference. 
Remember that a grammar is effectively just a giant "if statement 
machine" and so similar logic holds. Just as we can do invalid 
things with if statements we can do similar things with gammars.

The grammars do exactly as we tell them. Maybe we want infinite 
cycles! Maybe want them to terminate only after 150 loops, e.g.,

Y = Y_(n<150) | X

which could be read to use the rule Y for 150 times and then 
switch to rule X. (this is just a sort of spinwait in the 
grammar). e.g., if our grammar was expressing how enemies think 
in video game, this could be seen as a sort of pause in their 
thinking.

If could be used to draw a fractal where we could have several 
parallel grammars workings at the same time and this burns up 
cycles while other grammars do things.

It is a contrived example, of course...

For programming languages, we do not want complexities that we 
can't wrap our mind around(since programming languages are the 
basis for building complex programs it would make the complexity 
too difficult to handle in the long run).

The rule to solve these problems is to use the short-circuit 
concept as then you can list your rules in order from least 
complex to most. The least complex, if a terminal, when then 
always allow your rules to terminate if it is properly matched.


So your job is to construct your terminating string literal by 
knowing what makes the tokens not terminate... then simply make 
sure they can't be in your terminating string literal.

Second, construct your non-terminating rules to be how you want 
the nesting to be.

This is a general procedure. Any "sequence" of things have these 
properties. A literal in a grammar is simply a terminal rule that 
allows all the other rules to terminate at some point. Without 
them, we can't build a structure than can terminate.

e.g., lexing an integer

integer = Digit | (Digit & (Digit | (...) = Digit*

Is a recursive rule. It's just that the recursion as 0 depth in 
some sense or that we use meta rules to help make the rule 
succinct(e.g., the *). Digit, a terminal, terminates the 
recursion.

We could write it as

integer = Digit | DigitSeq
DigitSeq = Digit | DigitSeq

or

integer = Digit | DigitSeq
DigitSeq = Digit | integer


or

integer = Digit | integer

(which, in this case is simple because any integer can be seen as 
two integers concatenated, except for a single digit)

but, of course

integer = integer | Digit

makes no sense because it is effectively an infinite loop(we have 
to try to match some terminal to to get somewhere)

Obviously things can get tricky, but programming languages are 
designed so that ambiguities and complexity is not desirable. 
Programmers are in the business of programming to get a job done. 
If they burn needless cycles(like the spin loop) for no benefit, 
then they are less efficient at their job.

Hopefully that helps some.

Jan 23 2017

D Programming

C/C++ Programming

Other

digitalmars.D - Writing a JFlex lexer for D - have an issue with cycles