www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - exercise on range: lexeme stream

[sorry for double posting, first sent this to D's general list by error]


I have a working lexing toolkit that allows defining a lexer from a language's 
"morphology", then let it scan sources. The result is a 'LexemeStream' struct, 
which just wraps an array of Lexeme's; for simplicity, each lexeme holds a 
string tag identifying the kind of lexeme, and the matched slice. Simple

         Morphology morphology = [
             [ "SPACING" ,        `[\ \t]*` ],
             [ "OPEN_GROUP" ,     `\(` ],
             [ "CLOSE_GROUP" ,    `\)` ],
             [ "number" ,         `[\+\-]?[0-9]+(\.[0-9]+)?` ],
             [ "symbol" ,         `[a-zA-A][a-zA-A0-9]*` ],
             [ "operator" ,       `[\+\-\*\/]` ],
         auto lexer = new Lexer(morphology);
         // LexemeStream struct:
         auto lexemes = lexer.lexemes(source);

LexemeStream mainly provides a single operational method 'match' that works 
more or less like D's builtin 'in' operator:
     Lexeme* match (string tag)
It returns a pointer to the current lexeme if said lexeme is of the right kind, 
else null. LexemeStream maintains its own cursor (an ordinal index), so that a 
client parser does not need to update and pass around indices. Very practical. 
It also exposes an (automagically maintained) 'empty' boolean member. For 
instance, matching a number may look like:
     NumberNode readNumber (LexemeStream lexemes) { // no index!
         auto lexeme = lexemes.match("number");
         if (!lexeme)
             return null;
         return new NumberNode(lexeme.slice);
         // cursor has advanced for further parsing

 From this point of view, LexemeStream behaves similarly to an input range. But 
match does both popFront and front, and even also maintains empty in the 
background. I wonder about the best way to rewrite LexemeStream into a regular 
range. (*) (**)

Random thoughts:
I wonder in fact whether LexemeStream matches (no pun intended ;-) the general 
notion of range, or instead is a different sort of beast; if the latter, then
The key point, maybe, is that the stream advances /on conditon/. In other 
words, we do not just traverse it like for a regular iteration. We may need a 
parameter to popFront holding said conditon (here, the 'tag'). Then, front 
could return null (or whatever flag value) when the condition is not fulfilled, 
in which case the range does not move forward at all.
The above func would then look like:

     NumberNode readNumber (LexemeStream lexemes) { // no index!
         auto lexeme = lexemes.front();
         if (!lexeme)
             return null;
         return new NumberNode(lexeme.slice);

Another solution may be for popFront to return a boolean flag; then, a 
gentleman agreement is to read front only in positive (true) case:

     NumberNode readNumber (LexemeStream lexemes) { // no index!
         match = lexemes.popFront("number");
         if (!match)
             return null;
         auto lexeme = lexemes.front();
         return new NumberNode(lexeme.slice);


(*) A side issue (easily solvable) is that there must be a cursor. Meaning, 
LexemeStream cannot just shrink its internal array when matching. The reason is 
combinations of composite and choice patterns may require backtracking in 
lexeme stream:
     T readAssignment (LexemeStream lexemes) {
         auto cursor = lexemes.cursor    // store initial cursor
         // ... try & match target name ...
         // ... try & match '=' ...
         if (!lexeme) {
             lexemes.cursor = cursor0;   // restore intiial cursor
             return null;
         // ... try & match value expression ...
     T readStatement (LexemeStream lexemes) {
         // read assignment|func-call|control-statement|...
A solution is to make popFront advance the cursor instead of eating the array, 
front return array[cursor] instead of array[0], and empty == (cursor >= 
array.length) instead of (array.length == 0).

(**) I cannot really do it as of now anyway, because of the bug (in formatValue 
template constraints) preventing a range to define toString, but am still 
highly interested in the exercise --if only for the future.
vita es estrany

vita es estrany
Feb 04 2011