digitalmars.D - GSoC 2016 - std.xml rewrite

Lodovico Giaretta (14/14) Mar 08 2016 Hi!

CraigDillabaugh (6/20) Mar 08 2016 If you email me (craig dot dillabaugh at gmail dot com ) I can

Lodovico Giaretta (40/42) Mar 08 2016 I was thinking about the general structure of the parsing

Marc =?UTF-8?B?U2Now7x0eg==?= (8/15) Mar 10 2016 It could reuse its buffer though, and deal in `const(char)[]`

Lodovico Giaretta <lodovico giaretart.net> writes:

Hi!
My name is Lodovico Giaretta and I'm a second year CS student at 
University of Trento, Italy.

I saw on the D ideas page that std.xml needs a rewrite, and I 
also read a thread regarding this need 
(https://forum.dlang.org/thread/vsbsxfeciryrdsjhhfak forum.dlang.org).

I'm willing to partecipate to this task (in fact, I'm already 
looking at some other xml APIs and sketching some ideas...), 
mainly because I find D to be a very good language and I think 
that a more complete standard library would surely help it to 
spread.

What should I do to get in touch while waiting for the 
application period?

Thanks in advance for your time (and your patience)!

Mar 08 2016

CraigDillabaugh <craig.dillabaugh gmail.com> writes:

On Tuesday, 8 March 2016 at 14:55:07 UTC, Lodovico Giaretta wrote:
 Hi!
 My name is Lodovico Giaretta and I'm a second year CS student 
 at University of Trento, Italy.

 I saw on the D ideas page that std.xml needs a rewrite, and I 
 also read a thread regarding this need 
 (https://forum.dlang.org/thread/vsbsxfeciryrdsjhhfak forum.dlang.org).

 I'm willing to partecipate to this task (in fact, I'm already 
 looking at some other xml APIs and sketching some ideas...), 
 mainly because I find D to be a very good language and I think 
 that a more complete standard library would surely help it to 
 spread.

 What should I do to get in touch while waiting for the 
 application period?

 Thanks in advance for your time (and your patience)!

If you email me (craig dot dillabaugh at gmail dot com ) I can 
get you in touch with potential mentors (but I don't like posting 
other folks emails on here).

Also, if you have concrete ideas feel free to post those on here 
if you want feedback.

Mar 08 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Tuesday, 8 March 2016 at 16:20:00 UTC, CraigDillabaugh wrote:
 Also, if you have concrete ideas feel free to post those on 
 here if you want feedback.

I was thinking about the general structure of the parsing 
library, and I came up with this schema:

Lexer -> Low Level Parser -> High Level API

There should be various lexers feeding the low level parser, 
differing in the kind of input they accept:
- one accepting InputRanges, that can work with almost any data 
source and does not require the entire input to be available at 
the same time; it's cons are that it must allocate lots of small 
strings (one for each token) and grow them one char at a time, so 
it's not so fast;
- one accepting Slices, that benefits from fast searches and 
slicing, without needing any additional allocation; it's cons is 
that the entire input has to be loaded in RAM;
- an hybrid lexer, that tries to get the pros of both and the 
cons of none.

There should be various APIs feeded by the low level parser. Here 
I took inspiration from Java:
- a DOM API;
- a push parser (like SAX), conceptually similar to the actual 
std.xml.ElementParser;
- a pull parser (somehow inspired by StAX), that provides a 
cursor to scroll the input and also an InputRange interface, for 
easy integration with other D libraries (like std.algorithm).

Validating the input for well-formedness should probably be done 
between the parser and the high level API, so that structural 
issues (like missing close tags) are found and handled before 
affecting, for example, the building of the Document object.
Checking the validity of the document (such as conformance to the 
DTD) should instead be done on top of the high level API, which 
gives easy access, for example, to namespaces and attributes 
(which the low level parser leaves unparsed).

Both kinds of validators should be pluggable and configurable via 
template parameters, allowing to select which checks to perform 
and how to handle errors (throwing exceptions, calling registered 
callbacks or whatever).

After all of this is done, the next step should be an XPath 
library (I don't know much about this).

This is just an early sketch, but I'd love to get some feedback.
Thank you for your time.

Mar 08 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Tuesday, 8 March 2016 at 18:01:25 UTC, Lodovico Giaretta wrote:
 - one accepting InputRanges, that can work with almost any data 
 source and does not require the entire input to be available at 
 the same time; it's cons are that it must allocate lots of 
 small strings (one for each token) and grow them one char at a 
 time, so it's not so fast;

It could reuse its buffer though, and deal in `const(char)[]` 
instead of `string`. The consumer is responsible for copying if 
they need it. But I guess this will only be useful for the pull 
parser, or for one-pass filters/lookups.

 This is just an early sketch, but I'd love to get some feedback.
 Thank you for your time.

You might also check out Steven Schveighoffer's experimental IO 
module for stream parsing:

https://forum.dlang.org/thread/na14ul$30uo$1 digitalmars.com

Mar 10 2016

D Programming

C/C++ Programming

Other

digitalmars.D - GSoC 2016 - std.xml rewrite