digitalmars.D - GSoC 2016 - std.experimental.xml after a month

Lodovico Giaretta (60/60) Jun 23 2016 -- Brace yourself: a very long post is coming --

Jacob Carlborg (4/20) Jun 25 2016 Any range API, or plan for?

Lodovico Giaretta (10/11) Jun 25 2016 Hi,

Steven Schveighoffer (13/24) Jun 25 2016 When I had the gumption to try and make an XML parser range, the idea I

Lodovico Giaretta (14/27) Jun 25 2016 Thank you for your feedback.

Martin Nowak (3/12) Jun 27 2016 In any case it should be optional so that use cases that don't need

Lodovico Giaretta (2/9) Jun 28 2016 I absolutely agree on this.

Steven Schveighoffer (6/16) Jun 28 2016 In the case of ranges, front should be accessible, which means you have

Walter Bright (3/5) Jun 25 2016 Thank you for your hard work on this important project.

Lodovico Giaretta (7/10) Jun 25 2016 You mean that the document source may be a range?

Nikolay (10/18) Jun 28 2016 DOM - Any plans for Xpath?

Lodovico Giaretta (11/20) Jun 29 2016 About XPath and XSD, you already found the answer.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/11) Jun 29 2016 Indeed, but maybe you would consider making it possible to either

Lodovico Giaretta <lodovico giaretart.net> writes:

-- Brace yourself: a very long post is coming --

Hi,

One month after the official GSoC start, I want to share with you 
what's in std.experimental.xml and what will hopefully be there. 
If you have any question/improvement or anything to say, just 
leave a comment here or an issue on GitHub 
(https://github.com/lodo1995/experimental.xml).

In particular, if you think there are problems with the current 
structure of the project, or major flaws in the APIs, that will 
be very difficult to solve at a later stage, please let me know. 
(Walter and Andrei, I'd really appreciate your feedback here).

Thank you in advance to all who will take time to read this...

What is working?
- Four lexers are provided to abstract different kinds of input 
from the other layers, providing different speed characteristics;
- The parser splits the document into nodes, doing most of the 
hard work;
- A cursor sits on top of the parser, providing an API to advance 
in the document and get information about the current node; it 
supports string interning, which can drastically lower memory 
consumption (given that most nodes share names and attributes);
- A validating cursor is the same as a cursor, but allows the 
user to plug custom validators, that are executed while advancing 
in the input; in the future the library will provide some 
predefined validators to use with it;
- A very simple SAX API built on top of the cursor API is the 
last thing added and tested;
- A partial reimplementation of std.xml is there; when completed 
it will allow a gradual code transition.

What am I working on right now?
I'm trying to implement the DOM level 3 API. The API per se is 
not that difficult, but the infrastructure I'm building around it 
is a hell. In fact, I'm trying to make the DOM nodes reference 
counted and allocated with a custom allocator, to allow their 
usage in  nogc code. This is quite painful (because the DOM has 
lots of circular references, and "normal" reference counting does 
not work with them), but with enough time I will probably manage 
to make it work.

What is planned for the near future?
- When the DOM classes will be usable (even if not 100% complete) 
I will start working on a DOM parser to build them from the 
source;
- DTD check and entity substitution have to be implemented, and 
they will (I hope) fit nicely as pluggable components for the 
validating cursor;
- And of course some APIs to output XML.

What is (incidentally) inside the repository?
- Along with the DOM classes comes a wrapper that allows to 
allocate classes with a custom allocator and reference count them 
(that is, a RefCounted!T that works only for classes);
- A wonderful (or maybe not) benchmark driver that benchmarks the 
various components with various kinds of random generated files 
and prints some wonderful statistics and graphs;
- Needed by the benchmarking code, a simple API to collect 
statistical infos (average, median, deviation) from a range of 
measures;
- Needed by the cursor API, an Interner that can intern not only 
strings, but any array or class.

Thank you again for your time and help.

Lodovico Giaretta

Jun 23 2016

Jacob Carlborg <doob me.com> writes:

On 23/06/16 22:04, Lodovico Giaretta wrote:

 What is working?
 - Four lexers are provided to abstract different kinds of input from the
 other layers, providing different speed characteristics;
 - The parser splits the document into nodes, doing most of the hard work;
 - A cursor sits on top of the parser, providing an API to advance in the
 document and get information about the current node; it supports string
 interning, which can drastically lower memory consumption (given that
 most nodes share names and attributes);
 - A validating cursor is the same as a cursor, but allows the user to
 plug custom validators, that are executed while advancing in the input;
 in the future the library will provide some predefined validators to use
 with it;
 - A very simple SAX API built on top of the cursor API is the last thing
 added and tested;
 - A partial reimplementation of std.xml is there; when completed it will
 allow a gradual code transition.

Any range API, or plan for?

-- 
/Jacob Carlborg

Jun 25 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Saturday, 25 June 2016 at 15:59:40 UTC, Jacob Carlborg wrote:
 Any range API, or plan for?

Hi,

I'm definitely going to provide a wrapper around the Cursor API, 
that will provide InputRange access to all the children of the 
current node (this way the tree structure is maintained). I plan 
to upload it in a couple of days.

I'm still pondering if it's worth providing also a way to get a 
Range of the entire document tree, flattened; it's nodes would be 
in one-to-one correspondence with the events generated by a SAX 
parser (another API which does not preserve the tree structure).

Jun 25 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/25/16 12:26 PM, Lodovico Giaretta wrote:
 On Saturday, 25 June 2016 at 15:59:40 UTC, Jacob Carlborg wrote:
 Any range API, or plan for?

 Hi,

 I'm definitely going to provide a wrapper around the Cursor API, that
 will provide InputRange access to all the children of the current node
 (this way the tree structure is maintained). I plan to upload it in a
 couple of days.

 I'm still pondering if it's worth providing also a way to get a Range of
 the entire document tree, flattened; it's nodes would be in one-to-one
 correspondence with the events generated by a SAX parser (another API
 which does not preserve the tree structure).

When I had the gumption to try and make an XML parser range, the idea I 
had was to have the current element's tag and attributes, all parent 
elements' tags and attributes, and the currently parsed entity inside 
the element. If the entity you were parsing was an element, you could 
either popFront it, to get to the next element, or descend into it's 
children, and the element and it's attributes would be pushed onto the 
"element" stack.

The idea is to keep all the context alive, but not have to keep the 
entire file in memory.

Anyway, that's how I envisioned it. Haven't finished my i/o package yet, 
so it didn't materialize :)

-Steve

Jun 25 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Saturday, 25 June 2016 at 20:16:09 UTC, Steven Schveighoffer 
wrote:
 When I had the gumption to try and make an XML parser range, 
 the idea I had was to have the current element's tag and 
 attributes, all parent elements' tags and attributes, and the 
 currently parsed entity inside the element. If the entity you 
 were parsing was an element, you could either popFront it, to 
 get to the next element, or descend into it's children, and the 
 element and it's attributes would be pushed onto the "element" 
 stack.

 The idea is to keep all the context alive, but not have to keep 
 the entire file in memory.

 Anyway, that's how I envisioned it. Haven't finished my i/o 
 package yet, so it didn't materialize :)

 -Steve

Thank you for your feedback.

You idea is similar to my Cursor API, which has next() to reach 
the next element and enter()/exit() to descend to the first child 
or ascend to the closing tag of the parent.

But my implementation does not maintain the state of the parents, 
so you can't get any info about the parent from the children, 
unless you use exit() (in which case, you can get the parent's 
name from the closing tag).

But your idea about a stack keeping all the context informations 
is quite valuable, given that some validations need them (e.g. 
checking that all prefixes have been declared, and retrieving 
prefix/namespace associations).

Jun 25 2016

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 06/25/2016 10:33 PM, Lodovico Giaretta wrote:
 
 But my implementation does not maintain the state of the parents, so you
 can't get any info about the parent from the children, unless you use
 exit() (in which case, you can get the parent's name from the closing tag).
 
 But your idea about a stack keeping all the context informations is
 quite valuable, given that some validations need them (e.g. checking
 that all prefixes have been declared, and retrieving prefix/namespace
 associations).

In any case it should be optional so that use cases that don't need
parent information can avoid the overhead.

Jun 27 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Monday, 27 June 2016 at 22:36:36 UTC, Martin Nowak wrote:
 On 06/25/2016 10:33 PM, Lodovico Giaretta wrote:
 But your idea about a stack keeping all the context 
 informations is quite valuable, given that some validations 
 need them (e.g. checking that all prefixes have been declared, 
 and retrieving prefix/namespace associations).

 In any case it should be optional so that use cases that don't 
 need parent information can avoid the overhead.

I absolutely agree on this.

Jun 28 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/28/16 4:04 AM, Lodovico Giaretta wrote:
 On Monday, 27 June 2016 at 22:36:36 UTC, Martin Nowak wrote:
 On 06/25/2016 10:33 PM, Lodovico Giaretta wrote:
 But your idea about a stack keeping all the context informations is
 quite valuable, given that some validations need them (e.g. checking
 that all prefixes have been declared, and retrieving prefix/namespace
 associations).

 In any case it should be optional so that use cases that don't need
 parent information can avoid the overhead.

 I absolutely agree on this.

In the case of ranges, front should be accessible, which means you have 
to cache it.

It should be optional in the sense that if you are not parsing using 
nested ranges, but straight SAX, then you don't need to save state.

-Steve

Jun 28 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 6/23/2016 1:04 PM, Lodovico Giaretta wrote:
 One month after the official GSoC start, I want to share with you what's in
 std.experimental.xml and what will hopefully be there

Thank you for your hard work on this important project.

Please ensure that it has a range interface - a range is used as input.

Jun 25 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Saturday, 25 June 2016 at 20:32:33 UTC, Walter Bright wrote:
 Thank you for your hard work on this important project.

 Please ensure that it has a range interface - a range is used 
 as input.

You mean that the document source may be a range?

In that case, I already implemented a lexer that works with any 
InputRange, and another one that works with ForwardRanges (I 
hoped it to be way faster than the first, but it currently 
isn't). Both are way slower than the two lexers based on slices, 
of course.

Jun 25 2016

Nikolay <sibnick gmail.com> writes:

On Thursday, 23 June 2016 at 20:04:26 UTC, Lodovico Giaretta 
wrote:
 -- Brace yourself: a very long post is coming --

 What is planned for the near future?
 - When the DOM classes will be usable (even if not 100% 
 complete) I will start working on a DOM parser to build them 
 from the source;
 - DTD check and entity substitution have to be implemented, and 
 they will (I hope) fit nicely as pluggable components for the 
 validating cursor;


DOM - Any plans for Xpath?

DTD check - What about XSD? XSD is more popular now.

Also it would be nice to have something like JAXB (automatically 
bind and map DLang struct/classes to/from XML). But it may be 
part of next iteration or project.

PS

I find https://github.com/lodo1995/experimental.xml/issues/11 so 
actually you already answer to my questions about XSD & XPath.

Jun 28 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Tuesday, 28 June 2016 at 12:14:40 UTC, Nikolay wrote:
 DOM - Any plans for Xpath?

 DTD check - What about XSD? XSD is more popular now.

 Also it would be nice to have something like JAXB 
 (automatically bind and map DLang struct/classes to/from XML). 
 But it may be part of next iteration or project.

 PS

 I find https://github.com/lodo1995/experimental.xml/issues/11 
 so actually you already answer to my questions about XSD & 
 XPath.

About XPath and XSD, you already found the answer.

About automatic binding (JAXB like), I wasn't planning it, but 
the idea is that if we have a good, modular, extendable XML 
library, then we can build everything on top of it as separate 
projects. This is because one person cannot maintain all the 
possible extensions (XPath, XSD, JAXB, XQuery, ...) and because 
most of these should be provided by external packages, not by the 
standard library (otherwise the std XML library would end up 
being half the entire size of Phobos).

Thank you for your feedback.

Jun 29 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Wednesday, 29 June 2016 at 12:06:23 UTC, Lodovico Giaretta 
wrote:
 the idea is that if we have a good, modular, extendable XML 
 library, then we can build everything on top of it as separate 
 projects. This is because one person cannot maintain all the 
 possible extensions (XPath, XSD, JAXB, XQuery, ...)

Indeed, but maybe you would consider making it possible to either 
inject a user defined field into the Element node or make it 
possible to provide a custom Element node as generic parameter, 
for book-keeping. And some hooks for when the DOM is 
built/mutated.

Jun 29 2016

D Programming

C/C++ Programming

Other

digitalmars.D - GSoC 2016 - std.experimental.xml after a month