www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to select the regex that matches the first token of a string?

reply vnr <cfcr gmail.com> writes:
Hello,

I am trying to make a small generic lexer that bases its token 
analysis on regular expressions. The principle I have in mind is 
to define a token type table with its corresponding regular 
expression, here is the code I currently have:

```d
import std.regex;

/// ditto
struct Token
{
     /// The token type
	string type;
     /// The regex to match the token
	Regex!char re;
     /// The matched string
	string matched = null;
}

/// Function to find the right token in the given table
Token find(Token[] table, const(Captures!string delegate(Token) 
pure  safe) fn)
{
	foreach (token; table)
		if (fn(token)) return token;
	return Token("", regex(r""));
}

/// The lexer class
class Lexer
{
	private Token[] tokens;

     /// ditto
	this(Token[] tkns = [])
	{
		this.tokens = tkns;
	}


	override string toString() const
	{
		import std.algorithm : map;
		import std.conv : to;
		import std.format : format;

		return to!string
                     (this.tokens.map!(tok =>
                         format("(%s, %s)", tok.type, 
tok.matched)));
	}

     // Others useful methods ...
}

/// My token table
static Token[] table =
     [ Token("NUMBER", regex(r"(?:\d+(?:\.\d*)?|\.\d+)"))
     , Token("MINS", regex(r"\-"))
     , Token("PLUS", regex(r"\+")) ];

/// Build a new lexer
Lexer lex(string text)
{
	Token[] result = [];

	while (text.length > 0)
	{
		Token token = table.find((Token t) => matchFirst(text, t.re));
		const string tmatch = matchFirst(text, token.re)[0];

		result ~= Token(token.type, token.re, tmatch);
		text = text[tmatch.length .. $];
	}
	return new Lexer(result);
}

void main()
{
     import std.stdio : writeln;

	const auto l = lex("3+2");
	writeln(l);
}

```

When I run this program, it gives the following sequence:

```
["(NUMBER, 3)", "(NUMBER, 2)", "(NUMBER, 2)"]
```

While I want this:

```
["(NUMBER, 3)", "(PLUS, +)", "(NUMBER, 2)"]
```

The problem seems to come from the `find` function which returns 
the first regex to have match and not the regex of the first 
substring to have match (I hope I am clear enough 😅).

I'm not used to manipulating regex, especially in D, so I'm not 
sure how to consider a solution to this problem.

I thank you in advance for your help.
Jul 03
parent reply user1234 <user1234 12.de> writes:
On Saturday, 3 July 2021 at 09:05:28 UTC, vnr wrote:
 Hello,

 I am trying to make a small generic lexer that bases its token 
 analysis on regular expressions. The principle I have in mind 
 is to define a token type table with its corresponding regular 
 expression, here is the code I currently have:

 [...]
storing the regex in a token is an antipattern.
Jul 03
parent vnr <cfcr gmail.com> writes:
On Saturday, 3 July 2021 at 09:28:32 UTC, user1234 wrote:
 On Saturday, 3 July 2021 at 09:05:28 UTC, vnr wrote:
 Hello,

 I am trying to make a small generic lexer that bases its token 
 analysis on regular expressions. The principle I have in mind 
 is to define a token type table with its corresponding regular 
 expression, here is the code I currently have:

 [...]
storing the regex in a token is an antipattern.
Thank you for the answer, I know it's not clean, I'll modify my code to define a token type table with their regular expression and define a token type table with what has match; the former defining the lexer, the latter being the result of the latter. But for now and to keep it simple, I did everything in one.
Jul 03