www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - two design questions

Hello D-istos,


I am currenty implementing a kind of lexing toolkit. First time I do that. 
Below are design questions on the topic. Also, I would like to know whether you 
think such a module would be useful for th community od D programmers. And for 
which advantages, knowing that D directly link to C lexers like flex (I have 
some ideas on the question, indeed).


1. Lexeme types

Lexemes types defined by client code need to bring at least 2 pieces of
information
* a code representing the type
* a regex format (string)

If I decide type codes to be strings, then we get a very nice format in source 
for "morphologies":
     string[2][] morphology = [
         [ "SPC" ,       `[\ \t\n]*` ],
         [ "ASSIGN" ,    `=` ],
         [ "integer" ,   `[\+\-]?[1-9]+*` ],
         ...
     ];
A side advantage beeing that writing out a morphology or a single lexeme type 
bring a meaningful name (instead of a clueless nominal number: 
http://en.wikipedia.org/wiki/Nominal_number).

But: using strings as type codes is obviously a useless overload from the 
strict point-of-view of functionality; codes just need to be unique, thus a 
plain enum of uints or even ubytes used as nominals is a correct choice.
If I choose uint codes, then lexeme types must be structs (or else tuples, but 
they're worse). In this case, I can then take the opportunity to add a mode 
field. Which would give eg:
      LexemeType[] morphology = [
         LexemeType( "SPC" ,       `[\ \t\n]*` ,         SKIP ),
         LexemeType( "ASSIGN" ,    `=` ,                 MARK ),
         LexemeType( "integer" ,   `[\+\-]?[1-9]+*` ,    DATA ),
         ...
     ];
Far more annoying to write, ain't it?

Also, a 'mode' field is nearly useless as of now:
(1) for MARKs, I cannot avoid reading the slice yet anyway (see above), thus 
why not store it since there is no (additional) copy
(2) for SKIP'ped lexemes, I have a practical alternative allowing the parser to 
skip optional and non-significant tokens (still a bit stupid to record tokens 
just to ignore them later, but...)


2. match actions

I do not have any match action system yet. Actually, a 'mode' field would 
implement kinds of very special predefined actions. Is more really needed? 
Typically, in my experience of parsing, useful match actions happen at a higher 
level, namely at parsing rather than lexing time:
* Structure the AST, eg discard MARK tokens or flatten lists.
* Handle data, eg convert numbers or drop '"' from strings.
Structural actions can only be handled by the parser, I guess, while operations 
on data are nicely placed in dedicated Node type constructors.
What kinds of typical actions would really be useful for client code, at lexing 
time, especially ones allowing parser simplification? (else as handling SKIP 
tokens)


External points of view warmly welcome :-)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com
Feb 06 2011