digitalmars.D - ByToken Range
- Matthias Walter <xammy xammy.homelinux.net> Dec 11 2010
- Christopher Nicholson-Sauls <ibisbasenji gmail.com> Dec 11 2010
- Matthias Walter <xammy xammy.homelinux.net> Dec 12 2010
Hi all, I wrote a ByToken tokenizer that models Range, i.e. it can be used in a foreach loop to read from a std.stdio.File. For it to work one has to supply it with a delegate, taking a current buffer and a controller class instance. It is called to extract a token from the unprocessed part of the buffer, but can act as follows (by calling methods from the controller class): - It can skip some bytes. - It can succeed, by eating some bytes and setting the token to be read by the front() property. - It can request more data. - It can indicate that the data is invalid, in which case further processing is stopped and a user-supplied delegate is invoked that may or may not handle this failure. It is efficient, because it reuses the same buffer every time and just supplies the user with a slice of unprocessed data. If more data is requested, the remaining unprocessed part is copied to the beginning and more data is read. If there is no such unprocessed data, the buffer is enlarged, i.e. length doubled. The ByToken class has the type of a token as a template parameter. Does this behavior make sense? Any further suggestions? Is there any interest in having this functionality, i.e. should I create a dsource project, or does everybody use parser-generators for everything? Matthias
Dec 11 2010
On 12/11/10 22:41, Matthias Walter wrote:Hi all, I wrote a ByToken tokenizer that models Range, i.e. it can be used in a foreach loop to read from a std.stdio.File. For it to work one has to supply it with a delegate, taking a current buffer and a controller class instance. It is called to extract a token from the unprocessed part of the buffer, but can act as follows (by calling methods from the controller class): - It can skip some bytes. - It can succeed, by eating some bytes and setting the token to be read by the front() property. - It can request more data. - It can indicate that the data is invalid, in which case further processing is stopped and a user-supplied delegate is invoked that may or may not handle this failure. It is efficient, because it reuses the same buffer every time and just supplies the user with a slice of unprocessed data. If more data is requested, the remaining unprocessed part is copied to the beginning and more data is read. If there is no such unprocessed data, the buffer is enlarged, i.e. length doubled. The ByToken class has the type of a token as a template parameter. Does this behavior make sense? Any further suggestions? Is there any interest in having this functionality, i.e. should I create a dsource project, or does everybody use parser-generators for everything? Matthias
I write lexers/parsers relatively often -- and I don't use generators... because I'm masochistic like that! And because there aren't many options for D. There was Enki for D1 a while back, which might still work pretty well, and there's GOLD although I'm not aware of how their D support is right now. I might be forgetting another. So I, for one, like the idea of it at the very least. I'd have to see it in action, though, to say much beyond that. -- Chris N-S
Dec 11 2010
On 12/12/2010 02:04 AM, Christopher Nicholson-Sauls wrote:On 12/11/10 22:41, Matthias Walter wrote:Hi all, I wrote a ByToken tokenizer that models Range, i.e. it can be used in a foreach loop to read from a std.stdio.File. For it to work one has to supply it with a delegate, taking a current buffer and a controller class instance. It is called to extract a token from the unprocessed part of the buffer, but can act as follows (by calling methods from the controller class): - It can skip some bytes. - It can succeed, by eating some bytes and setting the token to be read by the front() property. - It can request more data. - It can indicate that the data is invalid, in which case further processing is stopped and a user-supplied delegate is invoked that may or may not handle this failure. It is efficient, because it reuses the same buffer every time and just supplies the user with a slice of unprocessed data. If more data is requested, the remaining unprocessed part is copied to the beginning and more data is read. If there is no such unprocessed data, the buffer is enlarged, i.e. length doubled. The ByToken class has the type of a token as a template parameter. Does this behavior make sense? Any further suggestions? Is there any interest in having this functionality, i.e. should I create a dsource project, or does everybody use parser-generators for everything? Matthias
because I'm masochistic like that! And because there aren't many options for D. There was Enki for D1 a while back, which might still work pretty well, and there's GOLD although I'm not aware of how their D support is right now. I might be forgetting another. So I, for one, like the idea of it at the very least. I'd have to see it in action, though, to say much beyond that.
http://pastebin.com/qjH6y0Mf As I'm going to use it for one or two real-world file formats I might change some things, but for now I like it. If you have any suggestions for improvements, please let me know. Matthias
Dec 12 2010









Christopher Nicholson-Sauls <ibisbasenji gmail.com> 