www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] Partial-reparsing thory?

reply "Nick Sabalausky" <a a.a> writes:
Is anyone aware of any (formal or informal) 
theory/information/research/articles/etc. that's already been done on the 
idea of reparsing *part* of an already-parsed source? Either lexing, parsing 
or both together. Preferably LR, but LL might be helpful, too. (I'd imagine 
it may be notably more difficult with LR. May need to keep track of all the 
parsing steps taken the first time around, and may need to continue 
re-parsing to the end of the source.)

I know code editors deal with that sort of thing a lot, but my understanding 
is they often accept a certain amount of inaccuracy (for the sake of being 
more line-oriented) and also tend to be more lexing-oriented and very light 
on parsing.
Apr 14 2011
parent reply spir <denis.spir gmail.com> writes:
On 04/15/2011 12:51 AM, Nick Sabalausky wrote:
 Is anyone aware of any (formal or informal)
 theory/information/research/articles/etc. that's already been done on the
 idea of reparsing *part* of an already-parsed source? Either lexing, parsing
 or both together. Preferably LR, but LL might be helpful, too. (I'd imagine
 it may be notably more difficult with LR. May need to keep track of all the
 parsing steps taken the first time around, and may need to continue
 re-parsing to the end of the source.)

 I know code editors deal with that sort of thing a lot, but my understanding
 is they often accept a certain amount of inaccuracy (for the sake of being
 more line-oriented) and also tend to be more lexing-oriented and very light
 on parsing.
This is a very interesting question. I guess it may have to play with "diffs", as a required tool to identify differences in-relation-to original source. The result may be a set of intervals (each a pair of indices) in both original and modified sources. Then, if ever nodes keep track of indices in original source, one may be able to identify the one node covering (all of) a diven piece of diff. This would be what needs be reparsed. If ever this succeds, then one can pass to next diff, with an adjustment/offset of indices which should normally be already given by the diff tool. Well, this superficial view may be totally wrong, and/or biased toward PEG parsing which I'm most familiar with. I guess things may well become more complicated with the common approach of 2-separated-pass lexing + parsing; at least, you'd need offset adjustment successively on character- and then on lexeme- indices. There may also be some similarity with the (apparently) simpler problem of error recovery: restart parsing "somewhere" after a match failure. Denis -- _________________ vita es estrany spir.wikidot.com
Apr 14 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"spir" <denis.spir gmail.com> wrote in message 
news:mailman.3527.1302824019.4748.digitalmars-d puremagic.com...
 On 04/15/2011 12:51 AM, Nick Sabalausky wrote:
 Is anyone aware of any (formal or informal)
 theory/information/research/articles/etc. that's already been done on the
 idea of reparsing *part* of an already-parsed source? Either lexing, 
 parsing
 or both together. Preferably LR, but LL might be helpful, too. (I'd 
 imagine
 it may be notably more difficult with LR. May need to keep track of all 
 the
 parsing steps taken the first time around, and may need to continue
 re-parsing to the end of the source.)

 I know code editors deal with that sort of thing a lot, but my 
 understanding
 is they often accept a certain amount of inaccuracy (for the sake of 
 being
 more line-oriented) and also tend to be more lexing-oriented and very 
 light
 on parsing.
This is a very interesting question. I guess it may have to play with "diffs", as a required tool to identify differences in-relation-to original source. The result may be a set of intervals (each a pair of indices) in both original and modified sources. Then, if ever nodes keep track of indices in original source, one may be able to identify the one node covering (all of) a diven piece of diff. This would be what needs be reparsed. If ever this succeds, then one can pass to next diff, with an adjustment/offset of indices which should normally be already given by the diff tool. Well, this superficial view may be totally wrong, and/or biased toward PEG parsing which I'm most familiar with. I guess things may well become more complicated with the common approach of 2-separated-pass lexing + parsing; at least, you'd need offset adjustment successively on character- and then on lexeme- indices. There may also be some similarity with the (apparently) simpler problem of error recovery: restart parsing "somewhere" after a match failure.
Yea. My thought was to keep note, for each node, of all the start/end indicies into the source. When a chunk of text is changed, the parse tree (or list of tokens if it's lex-only) gets notified about what got changed and where (Maybe the parse tree would even *be* the primary internal representation of the source). Then it can check through the tree and know what needs to be re-lexed/re-parsed. But I guess what I'm more interested in is the mechanics of actually doing that re-parsing once you know what sub-tree (or sub-trees) you need to re-parse. I guess it may seem simple, just run the lexer/parser on that "dirty" subset of the code (using the root of the affected sub-tree as the "starting nonterminal" instead of using the grammar's usual "starting nonterminal") and replace the "dirty" subtrees with the fresh new ones (and then update the start/end indicies of the rest of the tree, perhaps lazily.) But I suspect there may be some notable gotchas, especially relating to either LR parsing (since it's bottom-up, but you're tying to fit it into a known "top" - then again, I guess that's normally true of LR anyway) or just what to do with the nodes that come from after the affected part of the source (it could probably cause some sort of cascade, or at least a partial-cascade). And maybe certain sub-trees of the affected sub-trees could be skipped. Etc. I'd imagine lex-only would be a lot easier.
Apr 14 2011
next sibling parent spir <denis.spir gmail.com> writes:
On 04/15/2011 02:40 AM, Nick Sabalausky wrote:
 "spir"<denis.spir gmail.com>  wrote in message
 news:mailman.3527.1302824019.4748.digitalmars-d puremagic.com...
 On 04/15/2011 12:51 AM, Nick Sabalausky wrote:
 Is anyone aware of any (formal or informal)
 theory/information/research/articles/etc. that's already been done on the
 idea of reparsing *part* of an already-parsed source? Either lexing,
 parsing
 or both together. Preferably LR, but LL might be helpful, too. (I'd
 imagine
 it may be notably more difficult with LR. May need to keep track of all
 the
 parsing steps taken the first time around, and may need to continue
 re-parsing to the end of the source.)

 I know code editors deal with that sort of thing a lot, but my
 understanding
 is they often accept a certain amount of inaccuracy (for the sake of
 being
 more line-oriented) and also tend to be more lexing-oriented and very
 light
 on parsing.
This is a very interesting question. I guess it may have to play with "diffs", as a required tool to identify differences in-relation-to original source. The result may be a set of intervals (each a pair of indices) in both original and modified sources. Then, if ever nodes keep track of indices in original source, one may be able to identify the one node covering (all of) a diven piece of diff. This would be what needs be reparsed. If ever this succeds, then one can pass to next diff, with an adjustment/offset of indices which should normally be already given by the diff tool. Well, this superficial view may be totally wrong, and/or biased toward PEG parsing which I'm most familiar with. I guess things may well become more complicated with the common approach of 2-separated-pass lexing + parsing; at least, you'd need offset adjustment successively on character- and then on lexeme- indices. There may also be some similarity with the (apparently) simpler problem of error recovery: restart parsing "somewhere" after a match failure.
Yea. My thought was to keep note, for each node, of all the start/end indicies into the source. When a chunk of text is changed, the parse tree (or list of tokens if it's lex-only) gets notified about what got changed and where (Maybe the parse tree would even *be* the primary internal representation of the source). Then it can check through the tree and know what needs to be re-lexed/re-parsed. But I guess what I'm more interested in is the mechanics of actually doing that re-parsing once you know what sub-tree (or sub-trees) you need to re-parse. I guess it may seem simple, just run the lexer/parser on that "dirty" subset of the code (using the root of the affected sub-tree as the "starting nonterminal" instead of using the grammar's usual "starting nonterminal") and replace the "dirty" subtrees with the fresh new ones (and then update the start/end indicies of the rest of the tree, perhaps lazily.) But I suspect there may be some notable gotchas, especially relating to either LR parsing (since it's bottom-up, but you're tying to fit it into a known "top" - then again, I guess that's normally true of LR anyway) or just what to do with the nodes that come from after the affected part of the source (it could probably cause some sort of cascade, or at least a partial-cascade). And maybe certain sub-trees of the affected sub-trees could be skipped. Etc.
I guess it's indeed a bit more complicated in practice. n = i; ==> n = i + 1; Here the affected node is the one representing 'i', which should be something like say SymbolLookup("i"). You cannot reparse it as is, indeed. You'd have to restart "viewing" the string from a higher-level, namely the assignment, which pattern mat be like: assignment ::= name '=' expression The old node to be replaced starts where an expression is expected, and that's the pattern to be re-matched, not the pattern for a symbol. This time you'll get eg Addition(SymbolLookup("i"), IntegerLiteral("1")).
 I'd imagine lex-only would be a lot easier.
Sure. But I guess from the point of view of plain reparsing mechanics, as you say, a single-pass parsing engine like PEG is not much more complicated. You may study & prototype (1) lex-only (2) single-pass parsing (3) two-pass parsing. If ever you're interesting, I have a PEG engine in D and for D (working, not polished). I have added to its TODO list: STUDY: * diff parsing !!! (would use node start/end indices) Denis -- _________________ vita es estrany spir.wikidot.com
Apr 15 2011
prev sibling parent Rainer Schuetze <r.sagitario gmx.de> writes:
Nick Sabalausky wrote:
 "spir" <denis.spir gmail.com> wrote in message 
 news:mailman.3527.1302824019.4748.digitalmars-d puremagic.com...
 On 04/15/2011 12:51 AM, Nick Sabalausky wrote:
 Is anyone aware of any (formal or informal)
 theory/information/research/articles/etc. that's already been done on the
 idea of reparsing *part* of an already-parsed source? Either lexing, 
 parsing
 or both together. Preferably LR, but LL might be helpful, too. (I'd 
 imagine
 it may be notably more difficult with LR. May need to keep track of all 
 the
 parsing steps taken the first time around, and may need to continue
 re-parsing to the end of the source.)
I'd imagine lex-only would be a lot easier.
Sure it is. Visual D uses the Visual Studio approach for highlighting, keeping a 32-bit integer per line, indicating the state of the lexer. This state keeps track of token type, nesting level, etc. The state is calculated lazily, so you only have to scan the file until the last visible line. In addition, Visual D has a very simple parser to highlight version/debug conditionals, mostly just meant to figure out to what conditional an "else" belongs to (simpleparser.d). It keeps track of the text span that is covered by a parse tree node. If the text is modified, all parse tree nodes after the editing point are discarded, while the nodes including the edit point are restored to a state including the nodes before the span. The parser stack is reconstructed from the remaining parse tree and parsing can restart from a point very close to the actual editing point. As this parser only runs up to the last visible line, it has only a very small impact on performance and runs together with the lexer while the lines are drawn. The next version of Visual D will have a D2 parser (currently only used to highlight syntax errors), and I was hoping that a similar approach would be possible, but the state to keep is way more complicated. So it is running in a background thread for now, parsing the complete source.
Apr 15 2011