www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GSoC 2016 - std.experimental.xml after a month

reply Lodovico Giaretta <lodovico giaretart.net> writes:
-- Brace yourself: a very long post is coming --

Hi,

One month after the official GSoC start, I want to share with you 
what's in std.experimental.xml and what will hopefully be there. 
If you have any question/improvement or anything to say, just 
leave a comment here or an issue on GitHub 
(https://github.com/lodo1995/experimental.xml).

In particular, if you think there are problems with the current 
structure of the project, or major flaws in the APIs, that will 
be very difficult to solve at a later stage, please let me know. 
(Walter and Andrei, I'd really appreciate your feedback here).

Thank you in advance to all who will take time to read this...

What is working?
- Four lexers are provided to abstract different kinds of input 
from the other layers, providing different speed characteristics;
- The parser splits the document into nodes, doing most of the 
hard work;
- A cursor sits on top of the parser, providing an API to advance 
in the document and get information about the current node; it 
supports string interning, which can drastically lower memory 
consumption (given that most nodes share names and attributes);
- A validating cursor is the same as a cursor, but allows the 
user to plug custom validators, that are executed while advancing 
in the input; in the future the library will provide some 
predefined validators to use with it;
- A very simple SAX API built on top of the cursor API is the 
last thing added and tested;
- A partial reimplementation of std.xml is there; when completed 
it will allow a gradual code transition.

What am I working on right now?
I'm trying to implement the DOM level 3 API. The API per se is 
not that difficult, but the infrastructure I'm building around it 
is a hell. In fact, I'm trying to make the DOM nodes reference 
counted and allocated with a custom allocator, to allow their 
usage in  nogc code. This is quite painful (because the DOM has 
lots of circular references, and "normal" reference counting does 
not work with them), but with enough time I will probably manage 
to make it work.

What is planned for the near future?
- When the DOM classes will be usable (even if not 100% complete) 
I will start working on a DOM parser to build them from the 
source;
- DTD check and entity substitution have to be implemented, and 
they will (I hope) fit nicely as pluggable components for the 
validating cursor;
- And of course some APIs to output XML.

What is (incidentally) inside the repository?
- Along with the DOM classes comes a wrapper that allows to 
allocate classes with a custom allocator and reference count them 
(that is, a RefCounted!T that works only for classes);
- A wonderful (or maybe not) benchmark driver that benchmarks the 
various components with various kinds of random generated files 
and prints some wonderful statistics and graphs;
- Needed by the benchmarking code, a simple API to collect 
statistical infos (average, median, deviation) from a range of 
measures;
- Needed by the cursor API, an Interner that can intern not only 
strings, but any array or class.

Thank you again for your time and help.

Lodovico Giaretta
Jun 23 2016
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 23/06/16 22:04, Lodovico Giaretta wrote:

 What is working?
 - Four lexers are provided to abstract different kinds of input from the
 other layers, providing different speed characteristics;
 - The parser splits the document into nodes, doing most of the hard work;
 - A cursor sits on top of the parser, providing an API to advance in the
 document and get information about the current node; it supports string
 interning, which can drastically lower memory consumption (given that
 most nodes share names and attributes);
 - A validating cursor is the same as a cursor, but allows the user to
 plug custom validators, that are executed while advancing in the input;
 in the future the library will provide some predefined validators to use
 with it;
 - A very simple SAX API built on top of the cursor API is the last thing
 added and tested;
 - A partial reimplementation of std.xml is there; when completed it will
 allow a gradual code transition.
Any range API, or plan for? -- /Jacob Carlborg
Jun 25 2016
parent reply Lodovico Giaretta <lodovico giaretart.net> writes:
On Saturday, 25 June 2016 at 15:59:40 UTC, Jacob Carlborg wrote:
 Any range API, or plan for?
Hi, I'm definitely going to provide a wrapper around the Cursor API, that will provide InputRange access to all the children of the current node (this way the tree structure is maintained). I plan to upload it in a couple of days. I'm still pondering if it's worth providing also a way to get a Range of the entire document tree, flattened; it's nodes would be in one-to-one correspondence with the events generated by a SAX parser (another API which does not preserve the tree structure).
Jun 25 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/25/16 12:26 PM, Lodovico Giaretta wrote:
 On Saturday, 25 June 2016 at 15:59:40 UTC, Jacob Carlborg wrote:
 Any range API, or plan for?
Hi, I'm definitely going to provide a wrapper around the Cursor API, that will provide InputRange access to all the children of the current node (this way the tree structure is maintained). I plan to upload it in a couple of days. I'm still pondering if it's worth providing also a way to get a Range of the entire document tree, flattened; it's nodes would be in one-to-one correspondence with the events generated by a SAX parser (another API which does not preserve the tree structure).
When I had the gumption to try and make an XML parser range, the idea I had was to have the current element's tag and attributes, all parent elements' tags and attributes, and the currently parsed entity inside the element. If the entity you were parsing was an element, you could either popFront it, to get to the next element, or descend into it's children, and the element and it's attributes would be pushed onto the "element" stack. The idea is to keep all the context alive, but not have to keep the entire file in memory. Anyway, that's how I envisioned it. Haven't finished my i/o package yet, so it didn't materialize :) -Steve
Jun 25 2016
parent reply Lodovico Giaretta <lodovico giaretart.net> writes:
On Saturday, 25 June 2016 at 20:16:09 UTC, Steven Schveighoffer 
wrote:
 When I had the gumption to try and make an XML parser range, 
 the idea I had was to have the current element's tag and 
 attributes, all parent elements' tags and attributes, and the 
 currently parsed entity inside the element. If the entity you 
 were parsing was an element, you could either popFront it, to 
 get to the next element, or descend into it's children, and the 
 element and it's attributes would be pushed onto the "element" 
 stack.

 The idea is to keep all the context alive, but not have to keep 
 the entire file in memory.

 Anyway, that's how I envisioned it. Haven't finished my i/o 
 package yet, so it didn't materialize :)

 -Steve
Thank you for your feedback. You idea is similar to my Cursor API, which has next() to reach the next element and enter()/exit() to descend to the first child or ascend to the closing tag of the parent. But my implementation does not maintain the state of the parents, so you can't get any info about the parent from the children, unless you use exit() (in which case, you can get the parent's name from the closing tag). But your idea about a stack keeping all the context informations is quite valuable, given that some validations need them (e.g. checking that all prefixes have been declared, and retrieving prefix/namespace associations).
Jun 25 2016
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 06/25/2016 10:33 PM, Lodovico Giaretta wrote:
 
 But my implementation does not maintain the state of the parents, so you
 can't get any info about the parent from the children, unless you use
 exit() (in which case, you can get the parent's name from the closing tag).
 
 But your idea about a stack keeping all the context informations is
 quite valuable, given that some validations need them (e.g. checking
 that all prefixes have been declared, and retrieving prefix/namespace
 associations).
In any case it should be optional so that use cases that don't need parent information can avoid the overhead.
Jun 27 2016
parent reply Lodovico Giaretta <lodovico giaretart.net> writes:
On Monday, 27 June 2016 at 22:36:36 UTC, Martin Nowak wrote:
 On 06/25/2016 10:33 PM, Lodovico Giaretta wrote:
 But your idea about a stack keeping all the context 
 informations is quite valuable, given that some validations 
 need them (e.g. checking that all prefixes have been declared, 
 and retrieving prefix/namespace associations).
In any case it should be optional so that use cases that don't need parent information can avoid the overhead.
I absolutely agree on this.
Jun 28 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/28/16 4:04 AM, Lodovico Giaretta wrote:
 On Monday, 27 June 2016 at 22:36:36 UTC, Martin Nowak wrote:
 On 06/25/2016 10:33 PM, Lodovico Giaretta wrote:
 But your idea about a stack keeping all the context informations is
 quite valuable, given that some validations need them (e.g. checking
 that all prefixes have been declared, and retrieving prefix/namespace
 associations).
In any case it should be optional so that use cases that don't need parent information can avoid the overhead.
I absolutely agree on this.
In the case of ranges, front should be accessible, which means you have to cache it. It should be optional in the sense that if you are not parsing using nested ranges, but straight SAX, then you don't need to save state. -Steve
Jun 28 2016
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/23/2016 1:04 PM, Lodovico Giaretta wrote:
 One month after the official GSoC start, I want to share with you what's in
 std.experimental.xml and what will hopefully be there
Thank you for your hard work on this important project. Please ensure that it has a range interface - a range is used as input.
Jun 25 2016
parent Lodovico Giaretta <lodovico giaretart.net> writes:
On Saturday, 25 June 2016 at 20:32:33 UTC, Walter Bright wrote:
 Thank you for your hard work on this important project.

 Please ensure that it has a range interface - a range is used 
 as input.
You mean that the document source may be a range? In that case, I already implemented a lexer that works with any InputRange, and another one that works with ForwardRanges (I hoped it to be way faster than the first, but it currently isn't). Both are way slower than the two lexers based on slices, of course.
Jun 25 2016
prev sibling parent reply Nikolay <sibnick gmail.com> writes:
On Thursday, 23 June 2016 at 20:04:26 UTC, Lodovico Giaretta 
wrote:
 -- Brace yourself: a very long post is coming --

 What is planned for the near future?
 - When the DOM classes will be usable (even if not 100% 
 complete) I will start working on a DOM parser to build them 
 from the source;
 - DTD check and entity substitution have to be implemented, and 
 they will (I hope) fit nicely as pluggable components for the 
 validating cursor;
DOM - Any plans for Xpath? DTD check - What about XSD? XSD is more popular now. Also it would be nice to have something like JAXB (automatically bind and map DLang struct/classes to/from XML). But it may be part of next iteration or project. PS I find https://github.com/lodo1995/experimental.xml/issues/11 so actually you already answer to my questions about XSD & XPath.
Jun 28 2016
parent reply Lodovico Giaretta <lodovico giaretart.net> writes:
On Tuesday, 28 June 2016 at 12:14:40 UTC, Nikolay wrote:
 DOM - Any plans for Xpath?

 DTD check - What about XSD? XSD is more popular now.

 Also it would be nice to have something like JAXB 
 (automatically bind and map DLang struct/classes to/from XML). 
 But it may be part of next iteration or project.

 PS

 I find https://github.com/lodo1995/experimental.xml/issues/11 
 so actually you already answer to my questions about XSD & 
 XPath.
About XPath and XSD, you already found the answer. About automatic binding (JAXB like), I wasn't planning it, but the idea is that if we have a good, modular, extendable XML library, then we can build everything on top of it as separate projects. This is because one person cannot maintain all the possible extensions (XPath, XSD, JAXB, XQuery, ...) and because most of these should be provided by external packages, not by the standard library (otherwise the std XML library would end up being half the entire size of Phobos). Thank you for your feedback.
Jun 29 2016
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Wednesday, 29 June 2016 at 12:06:23 UTC, Lodovico Giaretta 
wrote:
 the idea is that if we have a good, modular, extendable XML 
 library, then we can build everything on top of it as separate 
 projects. This is because one person cannot maintain all the 
 possible extensions (XPath, XSD, JAXB, XQuery, ...)
Indeed, but maybe you would consider making it possible to either inject a user defined field into the Element node or make it possible to provide a custom Element node as generic parameter, for book-keeping. And some hooks for when the DOM is built/mutated.
Jun 29 2016