digitalmars.D.learn - How to select the regex that matches the first token of a string?

vnr (85/85) Jul 03 2021 Hello,

user1234 (2/8) Jul 03 2021 storing the regex in a token is an antipattern.

vnr (7/17) Jul 03 2021 Thank you for the answer,

vnr <cfcr gmail.com> writes:

Hello,

I am trying to make a small generic lexer that bases its token 
analysis on regular expressions. The principle I have in mind is 
to define a token type table with its corresponding regular 
expression, here is the code I currently have:

```d
import std.regex;

/// ditto
struct Token
{
     /// The token type
	string type;
     /// The regex to match the token
	Regex!char re;
     /// The matched string
	string matched = null;
}

/// Function to find the right token in the given table
Token find(Token[] table, const(Captures!string delegate(Token) 
pure  safe) fn)
{
	foreach (token; table)
		if (fn(token)) return token;
	return Token("", regex(r""));
}

/// The lexer class
class Lexer
{
	private Token[] tokens;

     /// ditto
	this(Token[] tkns = [])
	{
		this.tokens = tkns;
	}


	override string toString() const
	{
		import std.algorithm : map;
		import std.conv : to;
		import std.format : format;

		return to!string
                     (this.tokens.map!(tok =>
                         format("(%s, %s)", tok.type, 
tok.matched)));
	}

     // Others useful methods ...
}

/// My token table
static Token[] table =
     [ Token("NUMBER", regex(r"(?:\d+(?:\.\d*)?|\.\d+)"))
     , Token("MINS", regex(r"\-"))
     , Token("PLUS", regex(r"\+")) ];

/// Build a new lexer
Lexer lex(string text)
{
	Token[] result = [];

	while (text.length > 0)
	{
		Token token = table.find((Token t) => matchFirst(text, t.re));
		const string tmatch = matchFirst(text, token.re)[0];

		result ~= Token(token.type, token.re, tmatch);
		text = text[tmatch.length .. $];
	}
	return new Lexer(result);
}

void main()
{
     import std.stdio : writeln;

	const auto l = lex("3+2");
	writeln(l);
}

```

When I run this program, it gives the following sequence:

```
["(NUMBER, 3)", "(NUMBER, 2)", "(NUMBER, 2)"]
```

While I want this:

```
["(NUMBER, 3)", "(PLUS, +)", "(NUMBER, 2)"]
```

The problem seems to come from the `find` function which returns 
the first regex to have match and not the regex of the first 
substring to have match (I hope I am clear enough 😅).

I'm not used to manipulating regex, especially in D, so I'm not 
sure how to consider a solution to this problem.

I thank you in advance for your help.

Jul 03 2021

user1234 <user1234 12.de> writes:

On Saturday, 3 July 2021 at 09:05:28 UTC, vnr wrote:
 Hello,

 I am trying to make a small generic lexer that bases its token 
 analysis on regular expressions. The principle I have in mind 
 is to define a token type table with its corresponding regular 
 expression, here is the code I currently have:

 [...]

storing the regex in a token is an antipattern.

Jul 03 2021

vnr <cfcr gmail.com> writes:

On Saturday, 3 July 2021 at 09:28:32 UTC, user1234 wrote:
 On Saturday, 3 July 2021 at 09:05:28 UTC, vnr wrote:
 Hello,

 I am trying to make a small generic lexer that bases its token 
 analysis on regular expressions. The principle I have in mind 
 is to define a token type table with its corresponding regular 
 expression, here is the code I currently have:

 [...]

 storing the regex in a token is an antipattern.

Thank you for the answer,

I know it's not clean, I'll modify my code to define a token type 
table with their regular expression and define a token type table 
with what has match; the former defining the lexer, the latter 
being the result of the latter.

But for now and to keep it simple, I did everything in one.

Jul 03 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How to select the regex that matches the first token of a string?