digitalmars.D - New XML parser written for D1 and D2.

Michael Rynn (55/55) Oct 14 2009 I have made a validating or optionally none validating XML parser in

Nick B (11/36) Oct 14 2009 Michael
Saaa (4/8) Oct 14 2009 Maybe add it as an enhancement in bugzilla

Justin Johansson (3/15) Oct 14 2009 Why is the an Add File button on NG web posting form?

Saaa (5/20) Oct 14 2009 No, I think it is ok to use it if the button is there.

Brad Roberts (7/30) Oct 14 2009 Given that code is almost always undergoing change, posting to a forum (...

Walter Bright (3/13) Oct 14 2009 Yeah, it's bad manners .

Saaa (2/3) Oct 14 2009 Attaching it to an enhancement in bugzilla would be best, I think.

Andrei Alexandrescu (3/6) Oct 14 2009 Yes please. Making the code work with ranges as input would be great.

Justin Johansson (18/26) Oct 14 2009 Hi Andrei,

Jeremie Pelletier (22/53) Oct 14 2009 He meant range structs as found in std.range and their array wrappers in...

Justin Johansson (10/26) Oct 14 2009 Oh, okay. Just groked src and looks like it is a D2 only thing. Do you...

Jeremie Pelletier (7/39) Oct 14 2009 I don't know where the word range comes from, sorry. I see them as

Justin Johansson (14/55) Oct 14 2009 I don't know if it is slower or not .. just that it is reliable and deal...

Jason House (2/10) Oct 14 2009 If you're familiar with C++, that's easy. Ranges are a generalization ...

Michael Rynn <michaelrynn optusnet.com.au> writes:

I have made a validating or optionally none validating XML parser in
D.

It can read and parse  files and external dtds and entities with
differrent BOM and encodings.

This xmlp (XmlPieceParser class) passes 100% on both validating and
non-validating modes for the following test sets:- oasis, sun, xmltes
and  ibm.  I have not dared to try any of the xml 1.1 or other tests.
The warnings given by, if you choose to intercept them,  for not
well-formed or  non-valid documents may not necessarily be
illuminating.

My brief try of a modified std.xml against some of these tests led me
to chuck it, as I learned more what the parser is actually supposed to
do.  This one is all my own mistakes and bad coding habits, written
from near scratch, after giving up on std.xml, and taking what I could
from std.encoding.
I have also made a front end xmlp.delegator module that emulates the
delagate callback model of std.xml.

To use, you need to have a class derived from  XmlParserInput, of
which there are two instances, StreamParserInput and StringParserInput
in xmlp.input.   These wrap an InputRange interface (empty, front,
popFront).  the bool validate flag is false by default.

Give a new XmlPieceParser the input, and an optional base directory
path, and call nextPiece()  repeatedly to get bits of the rather
sparse  XmlTree model  defined in xmlp.xmldom. Or call the static
XmlPieceParser.ReadDocument to get the entire thing at once.

 This parser should be adaptable to use with Tango, as there is only
minimal dependence on Phobos.
  I dont know how the Tango xml parser would cope with the w3c tests.
Any resemblance of this to the Tango xml parser will be pure
coincidence, as a brief glance at the Tango code some long time ago
left me none wiser.

I learnt a lot of XML minutiae while getting it to parse the hundreds
of w3c test cases.  I've included the conformance test program and
scripts as one of the examples.

Some validation, such as the ELEMENT content particles validater still
has wet glue and cement, and is not gauranteed to validate each and
every  deterministic  content model. 

I am sure this release will be considered to be code bloated at the
moment.  With all those test cases, some conditional coding and
variants became a bit too contrived.  After coding it for while it
just got too big. Alhough I do think I got better at it towards the
end.  There is some scope for shrinkage. The windows binary with D2 of
the XmlConformance test suite runner is 

Very possibly there is a non-validating parser inside that is a fair
bit smaller that this, that could one day be created by conditional
compiled or re-coded from it. 

The package has a base module name of xmlp.  I am not aiming for
std.xml as yet.  

There are of course lots of other things in the XML world,
schemas,relax-ng  xsl xpointer and xpath,  and this parser almost
brings us to this century.

But I would like to have it made available so others can test.

Where and to whom can I post the 56 KB source code zip?

---------------------
Michael Rynn

Oct 14 2009

Nick B <nickB gmail.com> writes:

Michael Rynn wrote:
 
 I have made a validating or optionally none validating XML parser in
 D.
 
 It can read and parse  files and external dtds and entities with
 differrent BOM and encodings.
 

[snip]
 
 Very possibly there is a non-validating parser inside that is a fair
 bit smaller that this, that could one day be created by conditional
 compiled or re-coded from it. 
 
 The package has a base module name of xmlp.  I am not aiming for
 std.xml as yet.  
 
 There are of course lots of other things in the XML world,
 schemas,relax-ng  xsl xpointer and xpath,  and this parser almost
 brings us to this century.
 
 But I would like to have it made available so others can test.
 
 Where and to whom can I post the 56 KB source code zip?
 
 ---------------------
 Michael Rynn

Michael

I would like to suggest that you contact the Tango community on IRC , at 
   #D.tango  , ( you can access them via  http://webchat.freenode.net/ 
) and  offer this code to them. You will likely need to describe the 
additional functionality over the existing Tango XML parser, additional 
  validation etc. How it could be intergrated into the existing Tango 
code base.

regards
Nick B

Oct 14 2009

"Saaa" <empty needmail.com> writes:

Michael Rynn wrote
 I have made a validating or optionally none validating XML parser in
 D.

nice

 But I would like to have it made available so others can test.

Maybe add it as an enhancement in bugzilla

 Where and to whom can I post the 56 KB source code zip?

NG attachement? Or is that considered bad manners?

Oct 14 2009

Justin Johansson <no spam.com> writes:

Saaa Wrote:

 Michael Rynn wrote
 I have made a validating or optionally none validating XML parser in
 D.

 nice
 
 But I would like to have it made available so others can test.

 Maybe add it as an enhancement in bugzilla
 
 Where and to whom can I post the 56 KB source code zip?

 NG attachement? Or is that considered bad manners?

Why is the an Add File button on NG web posting form?
Maybe that button should have an etiquette instruction aside it?

Oct 14 2009

"Saaa" <empty needmail.com> writes:

Justin Johansson wrote
 Saaa Wrote:

 Michael Rynn wrote
 I have made a validating or optionally none validating XML parser in
 D.

 nice

 But I would like to have it made available so others can test.

 Maybe add it as an enhancement in bugzilla

 Where and to whom can I post the 56 KB source code zip?

 NG attachement? Or is that considered bad manners?

 Why is the an Add File button on NG web posting form?

Never actually looked at the web posing form.

 Maybe that button should have an etiquette instruction aside it?

No, I think it is ok to use it if the button is there.
If we weren't supposed to use it some instructions would be in place.
But then I think removing the button would be the better option.

Oct 14 2009

Brad Roberts <braddr bellevue.puremagic.com> writes:

On Thu, 15 Oct 2009, Saaa wrote:

 Justin Johansson wrote
 Saaa Wrote:

 Michael Rynn wrote
 I have made a validating or optionally none validating XML parser in
 D.

 nice

 But I would like to have it made available so others can test.

 Maybe add it as an enhancement in bugzilla

 Where and to whom can I post the 56 KB source code zip?

 NG attachement? Or is that considered bad manners?

 Why is the an Add File button on NG web posting form?

 Never actually looked at the web posing form.
 
 Maybe that button should have an etiquette instruction aside it?

 No, I think it is ok to use it if the button is there.
 If we weren't supposed to use it some instructions would be in place.
 But then I think removing the button would be the better option.

Given that code is almost always undergoing change, posting to a forum (be 
it ng, email, web based forums, whatever) seems like a less than ideal 
path to go down.  Put it on dsource, or a website, or anywhere that's easy 
to update without having to push lots of data to lots of people.

Just a thought,
Brad

Oct 14 2009

Walter Bright <newshound1 digitalmars.com> writes:

Saaa wrote:
 Michael Rynn wrote
 I have made a validating or optionally none validating XML parser in
 D.

 nice
 
 But I would like to have it made available so others can test.

 Maybe add it as an enhancement in bugzilla
 
 Where and to whom can I post the 56 KB source code zip?

 NG attachement? Or is that considered bad manners?

Yeah, it's bad manners <g>.

I recommend dsource.org.

Oct 14 2009

"Saaa" <empty needmail.com> writes:

Michael Rynn wrote
 Where and to whom can I post the 56 KB source code zip?

Attaching it to an enhancement in bugzilla would be best, I think.

Oct 14 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Saaa wrote:
 Michael Rynn wrote
 Where and to whom can I post the 56 KB source code zip?

 Attaching it to an enhancement in bugzilla would be best, I think. 

Yes please. Making the code work with ranges as input would be great.

Andrei

Oct 14 2009

Justin Johansson <no spam.com> writes:

Andrei Alexandrescu Wrote:

 Saaa wrote:
 Michael Rynn wrote
 Where and to whom can I post the 56 KB source code zip?

 Attaching it to an enhancement in bugzilla would be best, I think. 

 
 Yes please. Making the code work with ranges as input would be great.
 
 Andrei

Hi Andrei,

Still being a D apprentice and not 100% conversant with D terminology yet, I
assume,
and not wanting to make an *ass* out of *u* and *me* :-),
that by "ranges" you mean making use of D sub char[] arrays over the input so
as to
minimize/obviate the need to allocate lots of small(er) strings to hold element
tagnames,
attribute names and values, text node contents and so on.

This assumption being correct, can you confirm or otherwise that the
consequence of such
a design would mean that by parsing, say, a 1MB XML in-memory document,
constructing a
node tree from the same and having the nodes directly referencing substrings in
the input document via string "ranges", the entire 1MB would be locked into
memory by the GC and not
collectable until the node tree itself is done with?

Now I might be completely off track;  perhaps instead you are thinking of SAX
style
parsing and passing arguments to the SAX event handling function via the said
ranges.  In
this scenario I guess the SAX client could decide whether or not to .dup the
ranges.

Depending on your clarification, I may have further comment based upon my
practical
experience in the XML domain.

Regards

Justin Johansson

Oct 14 2009

Jeremie Pelletier <jeremiep gmail.com> writes:

Justin Johansson wrote:
 Andrei Alexandrescu Wrote:
 
 Saaa wrote:
 Michael Rynn wrote
 Where and to whom can I post the 56 KB source code zip?

 Attaching it to an enhancement in bugzilla would be best, I think. 

 Yes please. Making the code work with ranges as input would be great.

 Andrei

 
 Hi Andrei,
 
 Still being a D apprentice and not 100% conversant with D terminology yet, I
assume,
 and not wanting to make an *ass* out of *u* and *me* :-),
 that by "ranges" you mean making use of D sub char[] arrays over the input so
as to
 minimize/obviate the need to allocate lots of small(er) strings to hold
element tagnames,
 attribute names and values, text node contents and so on.

He meant range structs as found in std.range and their array wrappers in 
std.array.

 This assumption being correct, can you confirm or otherwise that the
consequence of such
 a design would mean that by parsing, say, a 1MB XML in-memory document,
constructing a
 node tree from the same and having the nodes directly referencing substrings
in the input document via string "ranges", the entire 1MB would be locked into
memory by the GC and not
 collectable until the node tree itself is done with?

That is not the goal of ranges, a memory mapped file would be more 
efficient for what you describe.

A range is D's version of streams, so for example a simple reader might 
look like:

void read(T)(in T range) if(isInputRange!T) {
	while(!range.empty()) {
		auto elem = range.front();
		// process element
		range.popFront();
	}
}

The range implementation can be a simple 'string', a 'char[]', or a 
custom network channel that blocks on front() if the data is still loading.

 Now I might be completely off track;  perhaps instead you are thinking of SAX
style
 parsing and passing arguments to the SAX event handling function via the said
ranges.  In
 this scenario I guess the SAX client could decide whether or not to .dup the
ranges.

I think you confuse ranges with slices. Ranges are simply an interface 
for sequential or random data access. DOM trees and SAX callbacks are 
different methods of parsing the xml, a range is a method of accessing 
the data :)

Speaking of SAX, do we have a D implementation yet? If not I could write 
one, it sounds fun.

 Depending on your clarification, I may have further comment based upon my
practical
 experience in the XML domain.
 
 Regards
 
 Justin Johansson

Oct 14 2009

Justin Johansson <no spam.com> writes:

Jeremie Pelletier Wrote:

 He meant range structs as found in std.range and their array wrappers in 
 std.array.

Oh, okay.  Just groked src and looks like it is a D2 only thing.  Do you happen
to know
what the derivation of the word "range" with respect to streams is?  I haven't
come
across it before used in this context.

 A range is D's version of streams, so for example a simple reader might 
 look like:
 
 void read(T)(in T range) if(isInputRange!T) {
 	while(!range.empty()) {
 		auto elem = range.front();
 		// process element
 		range.popFront();
 	}
 }

 
 I think you confuse ranges with slices. Ranges are simply an interface 
 for sequential or random data access. DOM trees and SAX callbacks are 
 different methods of parsing the xml, a range is a method of accessing 
 the data :)

Yes seems that way; my question apparently asked upon D1 knowledge only.

Re SAX, it easy enough to get James Clark's Expat 'C' parser happening with D.
That has an event-based API.  Perhaps all the std D library needs do is wrap
this.
Whilst it's open source, dunno about the specific licensing issues though.

-- JJ

Oct 14 2009

Jeremie Pelletier <jeremiep gmail.com> writes:

Justin Johansson wrote:
 Jeremie Pelletier Wrote:
 
 He meant range structs as found in std.range and their array wrappers in 
 std.array.

 
 Oh, okay.  Just groked src and looks like it is a D2 only thing.  Do you
happen to know
 what the derivation of the word "range" with respect to streams is?  I haven't
come
 across it before used in this context.

I don't know where the word range comes from, sorry. I see them as 
streams because they work just the same, except for the different method 
names (ie front/back and put instead of read and write respectively).

 A range is D's version of streams, so for example a simple reader might 
 look like:

 void read(T)(in T range) if(isInputRange!T) {
 	while(!range.empty()) {
 		auto elem = range.front();
 		// process element
 		range.popFront();
 	}
 }

  
 I think you confuse ranges with slices. Ranges are simply an interface 
 for sequential or random data access. DOM trees and SAX callbacks are 
 different methods of parsing the xml, a range is a method of accessing 
 the data :)

 
 Yes seems that way; my question apparently asked upon D1 knowledge only.
 
 Re SAX, it easy enough to get James Clark's Expat 'C' parser happening with D.
 That has an event-based API.  Perhaps all the std D library needs do is wrap
this.
 Whilst it's open source, dunno about the specific licensing issues though.
 
 -- JJ
 

Isn't expat slower than libxml2's SAX? Anyways I'd rather code a SAX 
module in D, if only to better know the internals of this method.

Jeremie

Oct 14 2009

Justin Johansson <no spam.com> writes:

Jeremie Pelletier Wrote:

 Justin Johansson wrote:
 Jeremie Pelletier Wrote:
 
 He meant range structs as found in std.range and their array wrappers in 
 std.array.

 
 Oh, okay.  Just groked src and looks like it is a D2 only thing.  Do you
happen to know
 what the derivation of the word "range" with respect to streams is?  I haven't
come
 across it before used in this context.

 
 I don't know where the word range comes from, sorry. I see them as 
 streams because they work just the same, except for the different method 
 names (ie front/back and put instead of read and write respectively).
 
 A range is D's version of streams, so for example a simple reader might 
 look like:

 void read(T)(in T range) if(isInputRange!T) {
 	while(!range.empty()) {
 		auto elem = range.front();
 		// process element
 		range.popFront();
 	}
 }

  
 I think you confuse ranges with slices. Ranges are simply an interface 
 for sequential or random data access. DOM trees and SAX callbacks are 
 different methods of parsing the xml, a range is a method of accessing 
 the data :)

 
 Yes seems that way; my question apparently asked upon D1 knowledge only.
 
 Re SAX, it easy enough to get James Clark's Expat 'C' parser happening with D.
 That has an event-based API.  Perhaps all the std D library needs do is wrap
this.
 Whilst it's open source, dunno about the specific licensing issues though.
 
 -- JJ
 

 
 Isn't expat slower than libxml2's SAX? Anyways I'd rather code a SAX 
 module in D, if only to better know the internals of this method.

I don't know if it is slower or not .. just that it is reliable and deals with
all subtleties of XML and
coming from James Clark (one cool dude), I suspect he probably got it right. 
Oh, and it written
in agnostic C.
It's easy enough though to write an XML parser and SAX like interface in an
afternoon, 
maybe flow into the evening, and get 98% of the way there.
The other 2% will take you a month of Sundays.
That's the problem with XML is that it's deceptively simple.  Apparently Sun
cannot even get it
right or just didn't bother.  Elliotte Rusty Harold's done a pretty good job
with XOM; it's
held in very high regard in the Java/XML community.

As Michael Rynn (who started this thread) can vouch,
if you are looking for a model, std.xml (in D2) is *not* the role.

Cheers
Justin Johansson

Oct 14 2009

Jason House <jason.james.house gmail.com> writes:

Justin Johansson Wrote:

 Jeremie Pelletier Wrote:
 
 He meant range structs as found in std.range and their array wrappers in 
 std.array.

 
 Oh, okay.  Just groked src and looks like it is a D2 only thing.  Do you
happen to know
 what the derivation of the word "range" with respect to streams is?  I haven't
come
 across it before used in this context.

If you're familiar with C++, that's easy.   Ranges are a generalization for a
pair of iterators. If that doesn't make sense, think of an iterator as a read
cursor in a stream/array/data structure. To safely scan with the cursor
requires a starting point and an end point. Andrei has a video titled
"iterators must go"

Oct 14 2009

D Programming

C/C++ Programming

Other

digitalmars.D - New XML parser written for D1 and D2.