digitalmars.D - ByToken Range

Matthias Walter (25/25) Dec 11 2010 Hi all,

Christopher Nicholson-Sauls (9/41) Dec 11 2010 I write lexers/parsers relatively often -- and I don't use generators...

Matthias Walter (7/47) Dec 12 2010 My current version can be used as follows to yield a simple word-tokeniz...

Matthias Walter <xammy xammy.homelinux.net> writes:

Hi all,

I wrote a ByToken tokenizer that models Range, i.e. it can be used in a
foreach loop to read from a std.stdio.File. For it to work one has to
supply it with a delegate, taking a current buffer and a controller
class instance. It is called to extract a token from the unprocessed
part of the buffer, but can act as follows (by calling methods from the
controller class):

- It can skip some bytes.
- It can succeed, by eating some bytes and setting the token to be read
by the front() property.
- It can request more data.
- It can indicate that the data is invalid, in which case further
processing is stopped and a user-supplied delegate is invoked that may
or may not handle this failure.


It is efficient, because it reuses the same buffer every time and just
supplies the user with a slice of unprocessed data. If more data is
requested, the remaining unprocessed part is copied to the beginning and
more data is read. If there is no such unprocessed data, the buffer is
enlarged, i.e. length doubled.

The ByToken class has the type of a token as a template parameter.

Does this behavior make sense? Any further suggestions?
Is there any interest in having this functionality, i.e. should I create
a dsource project,
or does everybody use parser-generators for everything?

Matthias

Dec 11 2010

Christopher Nicholson-Sauls <ibisbasenji gmail.com> writes:

On 12/11/10 22:41, Matthias Walter wrote:
 Hi all,
 
 I wrote a ByToken tokenizer that models Range, i.e. it can be used in a
 foreach loop to read from a std.stdio.File. For it to work one has to
 supply it with a delegate, taking a current buffer and a controller
 class instance. It is called to extract a token from the unprocessed
 part of the buffer, but can act as follows (by calling methods from the
 controller class):
 
 - It can skip some bytes.
 - It can succeed, by eating some bytes and setting the token to be read
 by the front() property.
 - It can request more data.
 - It can indicate that the data is invalid, in which case further
 processing is stopped and a user-supplied delegate is invoked that may
 or may not handle this failure.
 
 
 It is efficient, because it reuses the same buffer every time and just
 supplies the user with a slice of unprocessed data. If more data is
 requested, the remaining unprocessed part is copied to the beginning and
 more data is read. If there is no such unprocessed data, the buffer is
 enlarged, i.e. length doubled.
 
 The ByToken class has the type of a token as a template parameter.
 
 Does this behavior make sense? Any further suggestions?
 Is there any interest in having this functionality, i.e. should I create
 a dsource project,
 or does everybody use parser-generators for everything?
 
 Matthias

I write lexers/parsers relatively often -- and I don't use generators...
because I'm masochistic like that!  And because there aren't many
options for D.  There was Enki for D1 a while back, which might still
work pretty well, and there's GOLD although I'm not aware of how their D
support is right now.  I might be forgetting another.

So I, for one, like the idea of it at the very least.  I'd have to see
it in action, though, to say much beyond that.

-- Chris N-S

Dec 11 2010

Matthias Walter <xammy xammy.homelinux.net> writes:

On 12/12/2010 02:04 AM, Christopher Nicholson-Sauls wrote:
 On 12/11/10 22:41, Matthias Walter wrote:
 Hi all,

 I wrote a ByToken tokenizer that models Range, i.e. it can be used in a
 foreach loop to read from a std.stdio.File. For it to work one has to
 supply it with a delegate, taking a current buffer and a controller
 class instance. It is called to extract a token from the unprocessed
 part of the buffer, but can act as follows (by calling methods from the
 controller class):

 - It can skip some bytes.
 - It can succeed, by eating some bytes and setting the token to be read
 by the front() property.
 - It can request more data.
 - It can indicate that the data is invalid, in which case further
 processing is stopped and a user-supplied delegate is invoked that may
 or may not handle this failure.


 It is efficient, because it reuses the same buffer every time and just
 supplies the user with a slice of unprocessed data. If more data is
 requested, the remaining unprocessed part is copied to the beginning and
 more data is read. If there is no such unprocessed data, the buffer is
 enlarged, i.e. length doubled.

 The ByToken class has the type of a token as a template parameter.

 Does this behavior make sense? Any further suggestions?
 Is there any interest in having this functionality, i.e. should I create
 a dsource project,
 or does everybody use parser-generators for everything?

 Matthias

 I write lexers/parsers relatively often -- and I don't use generators...
 because I'm masochistic like that!  And because there aren't many
 options for D.  There was Enki for D1 a while back, which might still
 work pretty well, and there's GOLD although I'm not aware of how their D
 support is right now.  I might be forgetting another.

 So I, for one, like the idea of it at the very least.  I'd have to see
 it in action, though, to say much beyond that.

My current version can be used as follows to yield a simple word-tokenizer:

http://pastebin.com/qjH6y0Mf

As I'm going to use it for one or two real-world file formats I might
change some things, but for now I like it. If you have any suggestions
for improvements, please let me know.

Matthias

Dec 12 2010

D Programming

C/C++ Programming

Other

digitalmars.D - ByToken Range