digitalmars.D.learn - Writing a really fast lexer

vnr (15/15) Dec 11 2020 For a project with good performance, I would need to be able to

H. S. Teoh (37/52) Dec 11 2020 If you want a *really* fast lexer, I recommend using GNU Flex

vnr (4/13) Dec 12 2020 Thank you for this answer :)

Bastiaan Veelo (6/17) Dec 12 2020 Have you looked at Pegged [1]? It will give you the lexer and

vnr (6/23) Dec 12 2020 Yes, I know Pegged, it's a really interesting parser generator

Bastiaan Veelo (6/19) Dec 12 2020 Are you able to share the grammar? Since you plan to parse using
Sebastiaan Koppe (6/10) Dec 12 2020 The performance of pegged isn't great. Neither memory nor cpu

vnr <cfcr gmail.com> writes:

For a project with good performance, I would need to be able to 
analyse text. To do so, I would write a parser by hand using the 
recursive descent algorithm, based on a stream of tokens. I 
started writing a lexer with the d-lex package 
(https://code.dlang.org/packages/d-lex), it works really well, 
unfortunately, it's quite slow for the number of lines I'm aiming 
to analyse (I did a test, for a million lines, it lasted about 3 
minutes). As the parser will only have to manipulate tokens, I 
think that the performance of the lexer will be more important to 
consider. Therefore, I wonder what resources there are, in D, for 
writing an efficient lexer. I could of course write it by hand, 
but I would be interested to know what already exists, so as not 
to reinvent the wheel. Of course, if the standard library (or its 
derivatives, such as mir) has features that could be of interest 
to me for this lexer, I am interested. Thanks to you :)

Dec 11 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Dec 11, 2020 at 07:49:12PM +0000, vnr via Digitalmars-d-learn wrote:
 For a project with good performance, I would need to be able to
 analyse text. To do so, I would write a parser by hand using the
 recursive descent algorithm, based on a stream of tokens. I started
 writing a lexer with the d-lex package
 (https://code.dlang.org/packages/d-lex), it works really well,
 unfortunately, it's quite slow for the number of lines I'm aiming to
 analyse (I did a test, for a million lines, it lasted about 3
 minutes). As the parser will only have to manipulate tokens, I think
 that the performance of the lexer will be more important to consider.
 Therefore, I wonder what resources there are, in D, for writing an
 efficient lexer. I could of course write it by hand, but I would be
 interested to know what already exists, so as not to reinvent the
 wheel. Of course, if the standard library (or its derivatives, such as
 mir) has features that could be of interest to me for this lexer, I am
 interested. Thanks to you :)

If you want a *really* fast lexer, I recommend using GNU Flex
(https://github.com/westes/flex/).  Unfortunately, AFAIK it does not
support D directly; it generates a lexer in C that you then compile.
Fortunately, you can interface with the generated C code quite easily
from D.

I took a quick glance at d-lex, and immediately noticed that every rule
match incurs a GC allocation, which is sure to cause a performance
slowdown.  It does have a nice API, but for a truly fast lexer at the
very least you need to compile the lexing rules down into a fast IR,
which d-lex does not do.  So it's unsurprising that performance isn't
the best.

Some things to watch out for when writing high-performance
string-processing in D:

1) Avoid GC allocations as much as possible.  If some object needs to be
created frequently (e.g., tokens), try your best to make it a struct
rather than a class, to avoid incurring allocation overhead and to
decrease GC pressure. (This is a common mistake by people who write
lexers/parsers in D.)

2) Use slices of the input (e.g. for returning tokens) as much as
possible, avoid .dup/.idup as much as you can.  This may not always be
possible if you're reading from a file; if that's the case, consider
using std.mmfile to memory-map the input into memory so that you can
freely slice over the entire input without needing to handle
caching/paging yourself.  Ideally, your lexer should simply return
slices over the input, wrapped in a struct with an enum tag to indicate
what type of token it is.  Don't bother with things like converting
integers/floats in the input until semantic analysis, or when your
parser needs to create AST nodes that store the actual values. This
eliminates the need for polymorphic token types, which tends to slow
lexers down.

3) Avoid autodecoding unless you need it. Use .byCodeUnit when
processing strings with Phobos algorithms, unless the correctness of
what you need to do happens to depend on decoding Unicode code points.


T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a
California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs -
Natural Intelligence, Inc.

Dec 11 2020

vnr <cfcr gmail.com> writes:

On Friday, 11 December 2020 at 20:19:49 UTC, H. S. Teoh wrote:
 On Fri, Dec 11, 2020 at 07:49:12PM +0000, vnr via 
 Digitalmars-d-learn wrote:
 [...]

 If you want a *really* fast lexer, I recommend using GNU Flex 
 (https://github.com/westes/flex/).  Unfortunately, AFAIK it 
 does not support D directly; it generates a lexer in C that you 
 then compile. Fortunately, you can interface with the generated 
 C code quite easily from D.

 [...]

Thank you for this answer :)
I hadn't imagined using Flex, but it's true that the integration 
of C in D is quite impressive.

Dec 12 2020

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Friday, 11 December 2020 at 19:49:12 UTC, vnr wrote:
 For a project with good performance, I would need to be able to 
 analyse text. To do so, I would write a parser by hand using 
 the recursive descent algorithm, based on a stream of tokens. I 
 started writing a lexer with the d-lex package 
 (https://code.dlang.org/packages/d-lex), it works really well, 
 unfortunately, it's quite slow for the number of lines I'm 
 aiming to analyse (I did a test, for a million lines, it lasted 
 about 3 minutes). As the parser will only have to manipulate 
 tokens, I think that the performance of the lexer will be more 
 important to consider. Therefore, I wonder what resources there 
 are, in D, for writing an efficient lexer.

Have you looked at Pegged [1]? It will give you the lexer and 
parser in one go. I'd be very interested to see how it performs 
on that kind of input.

-- Bastiaan.

[1] https://code.dlang.org/packages/pegged

Dec 12 2020

vnr <cfcr gmail.com> writes:

On Saturday, 12 December 2020 at 16:43:43 UTC, Bastiaan Veelo 
wrote:
 On Friday, 11 December 2020 at 19:49:12 UTC, vnr wrote:
 For a project with good performance, I would need to be able 
 to analyse text. To do so, I would write a parser by hand 
 using the recursive descent algorithm, based on a stream of 
 tokens. I started writing a lexer with the d-lex package 
 (https://code.dlang.org/packages/d-lex), it works really well, 
 unfortunately, it's quite slow for the number of lines I'm 
 aiming to analyse (I did a test, for a million lines, it 
 lasted about 3 minutes). As the parser will only have to 
 manipulate tokens, I think that the performance of the lexer 
 will be more important to consider. Therefore, I wonder what 
 resources there are, in D, for writing an efficient lexer.

 Have you looked at Pegged [1]? It will give you the lexer and 
 parser in one go. I'd be very interested to see how it performs 
 on that kind of input.

 -- Bastiaan.

 [1] https://code.dlang.org/packages/pegged

Yes, I know Pegged, it's a really interesting parser generator 
engine, nevertheless, the grammar of what I would like to analyse 
is not a PEG. But I am also curious to know the performances of 
this tool for very large inputs.

Dec 12 2020

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Saturday, 12 December 2020 at 18:15:11 UTC, vnr wrote:
 On Saturday, 12 December 2020 at 16:43:43 UTC, Bastiaan Veelo 
 wrote:
 Have you looked at Pegged [1]? It will give you the lexer and 
 parser in one go. I'd be very interested to see how it 
 performs on that kind of input.

 -- Bastiaan.

 [1] https://code.dlang.org/packages/pegged

 Yes, I know Pegged, it's a really interesting parser generator 
 engine, nevertheless, the grammar of what I would like to 
 analyse is not a PEG. But I am also curious to know the 
 performances of this tool for very large inputs.

Are you able to share the grammar? Since you plan to parse using 
recursive descent, I think there is a good chance that the 
language can be defined as a PEG. I am using it to parse Pascal, 
whose grammar was defined long before PEG was a thing.

— Bastiaan.

Dec 12 2020

Sebastiaan Koppe <mail skoppe.eu> writes:

On Saturday, 12 December 2020 at 18:15:11 UTC, vnr wrote:
 Yes, I know Pegged, it's a really interesting parser generator 
 engine, nevertheless, the grammar of what I would like to 
 analyse is not a PEG. But I am also curious to know the 
 performances of this tool for very large inputs.

The performance of pegged isn't great. Neither memory nor cpu 
wise. In many cases that is fine though, but if you are 
performance sensitive, you should look elsewhere.

I did a full JS grammar in pegged, but switched to a handwritten 
one when I realised its poor performance.

Dec 12 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Writing a really fast lexer