www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - dxml 0.2.0 released

reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
dxml 0.2.0 has now been released.

I really wasn't planning on releasing anything this quickly after announcing
dxml, but when I went to start working on DOM support, it turned out to be
surprisingly quick and easy to implement. So, dxml now has basic DOM
support.

As part of that, it became clear that dxml.parser.stax should be renamed to
dxml.parser, since it's really the only parser (DOM support involves just
providing a way to hold the results of the parser, not any actual parsing,
and that's clear from the API rather than being an implementation detail),
and it makes for a shorter import path. So, I figured that I should do a
release sooner rather than later to reduce how many folks the rename ends up
affecting.

For this release, dxml.parser.stax is now an empty, deprecated, module that
publicly imports dxml.parser, but it will be removed in 0.3.0, whenever that
is released. So, the few folks who grabbed the initial release won't end up
with immediate code breakage if they upgrade.

One nice side effect of how I implemented DOM support is that it's trivial
to get the DOM for a portion of an XML document rather than the entire
thing, since it will produce a DOMEntity from any point in an EntityRange.

Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/
Github: https://github.com/jmdavis/dxml/tree/v0.2.0
Dub: http://code.dlang.org/packages/dxml

- Jonathan M Davis
Feb 11 2018
next sibling parent Aravinda VK <mail aravindavk.in> writes:
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:
 dxml 0.2.0 has now been released.

 I really wasn't planning on releasing anything this quickly 
 after announcing dxml, but when I went to start working on DOM 
 support, it turned out to be surprisingly quick and easy to 
 implement. So, dxml now has basic DOM support.

 As part of that, it became clear that dxml.parser.stax should 
 be renamed to dxml.parser, since it's really the only parser 
 (DOM support involves just providing a way to hold the results 
 of the parser, not any actual parsing, and that's clear from 
 the API rather than being an implementation detail), and it 
 makes for a shorter import path. So, I figured that I should do 
 a release sooner rather than later to reduce how many folks the 
 rename ends up affecting.

 For this release, dxml.parser.stax is now an empty, deprecated, 
 module that publicly imports dxml.parser, but it will be 
 removed in 0.3.0, whenever that is released. So, the few folks 
 who grabbed the initial release won't end up with immediate 
 code breakage if they upgrade.

 One nice side effect of how I implemented DOM support is that 
 it's trivial to get the DOM for a portion of an XML document 
 rather than the entire thing, since it will produce a DOMEntity 
 from any point in an EntityRange.

 Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/
 Github: https://github.com/jmdavis/dxml/tree/v0.2.0
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis
Awesome. Just tried it now as below and it works. Thanks for this library import std.stdio; import dxml.dom; struct Record { string name; string email; } Record[] parseRecords(string xml) { Record[] records; auto d = parseDOM!simpleXML(xml); auto root = d.children[0]; foreach(record; root.children) { auto rec = Record(); foreach(ele; record.children) { if (ele.name == "name") rec.name = ele.children[0].text; if (ele.name == "email") rec.email = ele.children[0].text; } records ~= rec; } return records; } void main() { auto xml = "<root>\n" ~ " <record>\n" ~ " <name>N1</name>\n" ~ " <email>E1</email>\n" ~ " </record>\n" ~ " <record>\n" ~ " <name>N2</name>\n" ~ " <email>E2</email>\n" ~ " </record>\n" ~ " <record>\n" ~ " <email>E3</email>\n" ~ " <name>N3</name>\n" ~ " </record>\n" ~ "<!--no comment -->\n" ~ "</root>"; auto records = parseRecords(xml); writeln(records); }
Feb 11 2018
prev sibling next sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:
 dxml 0.2.0 has now been released.

 I really wasn't planning on releasing anything this quickly 
 after announcing dxml, but when I went to start working on DOM 
 support, it turned out to be surprisingly quick and easy to 
 implement. So, dxml now has basic DOM support.

 [...]
Will this replace `std.xml` one day?
Feb 12 2018
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 12:38 PM, Chris wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:
 dxml 0.2.0 has now been released.

 I really wasn't planning on releasing anything this quickly after 
 announcing dxml, but when I went to start working on DOM support, it 
 turned out to be surprisingly quick and easy to implement. So, dxml 
 now has basic DOM support.

 [...]
Will this replace `std.xml` one day?
As long as DTD support is essentially non-existent, my vote will always be no.
Feb 12 2018
parent reply Chris <wendlec tcd.ie> writes:
On Monday, 12 February 2018 at 12:49:30 UTC, rikki cattermole 
wrote:
 On 12/02/2018 12:38 PM, Chris wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
 wrote:
 dxml 0.2.0 has now been released.

 I really wasn't planning on releasing anything this quickly 
 after announcing dxml, but when I went to start working on 
 DOM support, it turned out to be surprisingly quick and easy 
 to implement. So, dxml now has basic DOM support.

 [...]
Will this replace `std.xml` one day?
As long as DTD support is essentially non-existent, my vote will always be no.
How hard would it be to add DTD support? One could take dxml and extend it in order to include it in Phobos. I haven't used `std.xml` for years now. It is essentially dead and unusable atm.
Feb 12 2018
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 1:51 PM, Chris wrote:
 On Monday, 12 February 2018 at 12:49:30 UTC, rikki cattermole wrote:
 On 12/02/2018 12:38 PM, Chris wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:
 dxml 0.2.0 has now been released.

 I really wasn't planning on releasing anything this quickly after 
 announcing dxml, but when I went to start working on DOM support, it 
 turned out to be surprisingly quick and easy to implement. So, dxml 
 now has basic DOM support.

 [...]
Will this replace `std.xml` one day?
As long as DTD support is essentially non-existent, my vote will always be no.
How hard would it be to add DTD support? One could take dxml and extend it in order to include it in Phobos. I haven't used `std.xml` for years now. It is essentially dead and unusable atm.
From what I read in the other thread, it would require a complete redesign and a major performance hit. I don't care what J.M.D. puts in his own library. We just can't advertise to having an 'XML' library when we out right ignore a large portion of (and fairly important to real world adoption IMO) the specification for no other reason than personal opinions of the author. Now if you want a subset as the 'default' but have full support including DTD as an opt-in with the only difference is how you initialize the parser, I'd be happy and so will our end users in the future.
Feb 12 2018
prev sibling next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce 
wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis

 wrote:
 dxml 0.2.0 has now been released.

 I really wasn't planning on releasing anything this quickly
 after announcing dxml, but when I went to start working on DOM
 support, it turned out to be surprisingly quick and easy to
 implement. So, dxml now has basic DOM support.

 [...]
Will this replace `std.xml` one day?
Maybe. That depends on community feedback and ultimately on the Phobos review process. Assuming that there's support for putting it through the Phobos review process, then once I feel that it's complete enough and had enough use to make it clear that I didn't miss something critical, then I'll submit it for review. What little feedback there has been thus far has been positive, but it would be nice to get it battle-tested a bit, and there is still functionality that I need to add. Given that std.xml needs to be replaced, I think that it would be good if dxml were able to do that, but that depends heavily on what others think of what I've done and what they think Phobos' xml solution should look like. But the way things are going though, if dxml doesn't replace std.xml, I don't know that anything ever will. XML parsers are one of those things that everyone seems to want and no one seems to want to work on. However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design. Someone would basically have to write an entirely new parser to be able to handle it (some of dxml's internals could be reused, but they'd also have to be refactored a fair bit, and a ton of extra stuff would have to be added). Such a parser could theoretically coexist with dxml's parser, since each would provide its own advantages, but I have no plans to implement an XML parser to handle the DTD section. It's simply not worth my time or effort, and this project has already taken way more time and effort than I anticipated. However, std.xml does not support the DTD section, and glancing over it, it doesn't look like it even handles skipping the DTD section properly (it doesn't handle the fact that '>' can appear within quoted sections within the DTD). So, dxml is not worse than std.xml in that regard, and we wouldn't lose any functionality by having dxml replace std.xml. It just wouldn't necessarily do as much as some folks might like. My guess is that DTD support won't be a deal breaker given that std.xml doesn't support it, that std.xml has needed to be replaced for years now, and that no one else is working on replacing it, but I don't know. Disagreements over what should be done with std.json's replacement has meant that it has never been replaced even though significant work was done towards replacing it, so unfortunately, there's already precedence for a module not being replaced with something better due to disagreements over what the replacement would ideally be. So, I don't know. - Jonathan M Davis
Feb 12 2018
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:
 On Monday, February 12, 2018 12:38:51 Chris via 
 Digitalmars-d-announce wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis
 However, std.xml does not support the DTD section, and glancing 
 over it, it doesn't look like it even handles skipping the DTD 
 section properly (it doesn't handle the fact that '>' can 
 appear within quoted sections within the DTD). So, dxml is not 
 worse than std.xml in that regard, and we wouldn't lose any 
 functionality by having dxml replace std.xml. It just wouldn't 
 necessarily do as much as some folks might like.
I thought the same when I glanced over std.xml. There's no DTD support there either and I don't think it would be a deal breaker for most users.
 My guess is that DTD support won't be a deal breaker given that 
 std.xml doesn't support it, that std.xml has needed to be 
 replaced for years now, and that no one else is working on 
 replacing it, but I don't know. Disagreements over what should 
 be done with std.json's replacement has meant that it has never 
 been replaced even though significant work was done towards 
 replacing it, so unfortunately, there's already precedence for 
 a module not being replaced with something better due to 
 disagreements over what the replacement would ideally be. So, I 
 don't know.

 - Jonathan M Davis
Wasn't there a replacement module that never got past the initial review steps? Some GSoC thing or so. But I wonder if that module would be up to the latest D standards. While one may argue that DTD support is important, I would rather have something fast and simple like dxml that covers, say, 90% of the cases than nothing. It doesn't make sense to me that we should accept the current situation, only because of some bikeshedding that concerns 10% of the use cases. After all, it's only a module not a fundamental decision that concerns the direction D will take in the future. I think stuff like that can seriously turn off potential users. A lot of useful things begin with one person deciding to give it a go. vibe.d, dub, DScanner and DlangUI, for example. If the creators had started bikeshedding before writing the first line of code, there would still be a flamewar about the best way to go about it - and nothing would have happened.
Feb 12 2018
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 2:45 PM, Chris wrote:
 On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:
 On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce 
 wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis
 However, std.xml does not support the DTD section, and glancing over 
 it, it doesn't look like it even handles skipping the DTD section 
 properly (it doesn't handle the fact that '>' can appear within quoted 
 sections within the DTD). So, dxml is not worse than std.xml in that 
 regard, and we wouldn't lose any functionality by having dxml replace 
 std.xml. It just wouldn't necessarily do as much as some folks might 
 like.
I thought the same when I glanced over std.xml. There's no DTD support there either and I don't think it would be a deal breaker for most users.
 My guess is that DTD support won't be a deal breaker given that 
 std.xml doesn't support it, that std.xml has needed to be replaced for 
 years now, and that no one else is working on replacing it, but I 
 don't know. Disagreements over what should be done with std.json's 
 replacement has meant that it has never been replaced even though 
 significant work was done towards replacing it, so unfortunately, 
 there's already precedence for a module not being replaced with 
 something better due to disagreements over what the replacement would 
 ideally be. So, I don't know.

 - Jonathan M Davis
Wasn't there a replacement module that never got past the initial review steps? Some GSoC thing or so. But I wonder if that module would be up to the latest D standards.
https://github.com/dlang-community/experimental.xml Code isn't great, and not complete yet. Author has just disappeared sadly.
 While one may argue that DTD support is important, I would rather have 
 something fast and simple like dxml that covers, say, 90% of the cases 
 than nothing. It doesn't make sense to me that we should accept the 
 current situation, only because of some bikeshedding that concerns 10% 
 of the use cases. After all, it's only a module not a fundamental 
 decision that concerns the direction D will take in the future. I think 
 stuff like that can seriously turn off potential users. A lot of useful 
 things begin with one person deciding to give it a go. vibe.d, dub, 
 DScanner and DlangUI, for example. If the creators had started 
 bikeshedding before writing the first line of code, there would still be 
 a flamewar about the best way to go about it - and nothing would have 
 happened.
Everything you have mentioned is not in Phobos. Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do. Personally I find J.M.D. arguments quite reasonable for a third-party library, since yes it does cover 90% of the use cases.
Feb 12 2018
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 12 February 2018 at 14:54:48 UTC, rikki cattermole 
wrote:
 Just because something is 'good enough' does not make it 'good 
 enough' for Phobos. In the words of Andrei "Good enough is not 
 good enough", we need to aim higher to show what we actually 
 can do.
About 5 years ago (I think, I actually have the link on my other computer but it is 2,000 miles away right now), Andrei said something along the lines of "without the review process, we get junk like std.json". Ironically, that same review process may be why we still have such "junk". (actually personally, I don't hate std.json). If std.xml is really so bad and has been for so long, surely we ought to take an opportunity to change that, even if the change isn't perfect.
Feb 12 2018
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 3:08 PM, Adam D. Ruppe wrote:
 On Monday, 12 February 2018 at 14:54:48 UTC, rikki cattermole wrote:
 Just because something is 'good enough' does not make it 'good enough' 
 for Phobos. In the words of Andrei "Good enough is not good enough", 
 we need to aim higher to show what we actually can do.
About 5 years ago (I think, I actually have the link on my other computer but it is 2,000 miles away right now), Andrei said something along the lines of "without the review process, we get junk like std.json". Ironically, that same review process may be why we still have such "junk". (actually personally, I don't hate std.json). If std.xml is really so bad and has been for so long, surely we ought to take an opportunity to change that, even if the change isn't perfect.
It depends. The implementation does not need to be perfect or full fledged to go into experimental. But if at the start of the review process it is already well known that the public API would require a complete change to accommodate the intended goal it is unacceptable. Take std.experimental.allocators as an example. It currently is going through a massive API change, but when it first got PR'd, did we know that we should be RC'ing allocators? No of course not, otherwise we'd have done it. At this point in time I cannot say that dxml in good faith serves to represent the XML specification for the D community in full. This is unfortunately not about bike shedding. It is one thing to bike shed features, but when scope does not match the intended goal, we have got to be careful about what goes into Phobos. All J.M.D. has to do to change this, is make the API match the spec (as close as possible, without writing another parser) and separate out the implementation into a different and very clear module (probably a sub package) which states clearly that it is a subset with the full grammar listed that it supports. That way everybody is clear and we can later on get a full implementation as part of taking it out of experimental :)
Feb 12 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 15:26:24 rikki cattermole via Digitalmars-d-
announce wrote:
 All J.M.D. has to do to change this, is make the API match the spec (as
 close as possible, without writing another parser) and separate out the
 implementation into a different and very clear module (probably a sub
 package) which states clearly that it is a subset with the full grammar
 listed that it supports.
That literally cannot be done. dxml returns slices (or takeExactly's) of the original input. For it to do otherwise would harm performance and usability, but in order to implement full DTD support, it's impossible to return slices of the original input in the general case, because you have to be able to mutate the data whenever entity references get involved. If the API were entirely string-based, then whether the implementation returned slices or newly allocated strings could be an implementation detail, but as soon as you're dealing with arbitrary ranges of characters, that doesn't work. At that point, you're forced to either return strings for everything (which means allocating for any ranges that aren't strings) or to return a lazy range of characters and thus can't return the original type. And that means that if you pass it a string, you're stuck with a lazy range out the other end instead of a string, and to get a string again, you have to allocate, whereas with what I have now, the parser does almost no allocations, and as long as the input type supports slicing, you get exactly the same type out the other end, which is a huge usabality improvement IMHO. So, you can't have DTD support with the kind of API that dxml has, and changing the API to something that could work with DTD support would harm the parser for all of the cases where DTD support is unnecessary. Even if I were going to implement full DTD support, I would do it with another parser, not change the parser that dxml already has. And if dxml ends up in Phobos with the parser that it has, that doesn't prevent another parser from being added for the DTD case later if someone actually decides to put in the time and effort to do it. Either way, for any XML document that doesn't need DTD support, the way that dxml does things is more efficient and user-friendly than one that had DTD support would be, much as that obviously doesn't cut it for those documents that do need DTD support. In any case, I'm going to finish implementing dxml without any kind of DTD support and then see how things go as far as the Phobos review process goes. If dxml gets rejected, because the majority of folks think that we're better off with std.xml (or no xml parser at all in Phobos) than one that doesn't have DTD support, then oh well. That sucks, but anyone who wants dxml can then use it as a 3rd party library. I think that the D community would be worse off because of that, but it's not ultimately my decision to make, and either way, I have the parser that I need. - Jonathan M Davis
Feb 12 2018
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 3:50 PM, Jonathan M Davis wrote:
 In any case, I'm going to finish implementing dxml without any kind of DTD
 support and then see how things go as far as the Phobos review process goes.
 If dxml gets rejected, because the majority of folks think that we're better
 off with std.xml (or no xml parser at all in Phobos) than one that doesn't
 have DTD support, then oh well. That sucks, but anyone who wants dxml can
 then use it as a 3rd party library. I think that the D community would be
 worse off because of that, but it's not ultimately my decision to make, and
 either way, I have the parser that I need.
We are definitely not better off with just std.xml currently. The problem comes from the word currently. By going into Phobos even if experimental, its going to be around for a while in some form or another. So we need to invest a decent amount of time into not creating more problems for new users expecting the world and not getting it. If somebody (say a student?) were to write up a proper API and use dxml as a basis for a simpler parser, now that could be a worth while project and definitely could go into Phobos. I may even consider doing it at some point in the future.
Feb 12 2018
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via
Digitalmars-d-announce wrote:
[...]
 Everything you have mentioned is not in Phobos. Just because something
 is 'good enough' does not make it 'good enough' for Phobos. In the
 words of Andrei "Good enough is not good enough", we need to aim
 higher to show what we actually can do.
And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.
 Personally I find J.M.D. arguments quite reasonable for a third-party
 library, since yes it does cover 90% of the use cases.
As I have just said in another post, dxml itself does not need to be changed to implement DTD support. It's perfectly possible to write a wrapper on top of it that *does* implement DTD support. In fact, I dare say it might be possible to lazily switch from a thin wrapper over dxml to full DTD mode, so that end users don't even need to care about the difference if they don't care to. As far as API is concerned, it could be as simple as something like: auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...) { static if (DtdSupport) return dtdWrapper(dxmlParse(input)); else return dxmlParse(input); } Then just note in the documentation that turning off DTD support would provide extra features X, Y, and Z (speed, slices, whatever). Then let the user choose. Seriously, I would have thought something like this would be obvious to programmers of the calibre found on these forums. I'm a little astonished that this would even be such a point of contention in the first place, since the solution is so simple. T -- Many open minds should be closed for repairs. -- K5 user
Feb 12 2018
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 10:02 PM, H. S. Teoh wrote:
 On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via
Digitalmars-d-announce wrote:
 [...]
 Everything you have mentioned is not in Phobos. Just because something
 is 'good enough' does not make it 'good enough' for Phobos. In the
 words of Andrei "Good enough is not good enough", we need to aim
 higher to show what we actually can do.
And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.
 Personally I find J.M.D. arguments quite reasonable for a third-party
 library, since yes it does cover 90% of the use cases.
As I have just said in another post, dxml itself does not need to be changed to implement DTD support. It's perfectly possible to write a wrapper on top of it that *does* implement DTD support. In fact, I dare say it might be possible to lazily switch from a thin wrapper over dxml to full DTD mode, so that end users don't even need to care about the difference if they don't care to. As far as API is concerned, it could be as simple as something like: auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...) { static if (DtdSupport) return dtdWrapper(dxmlParse(input)); else return dxmlParse(input); } Then just note in the documentation that turning off DTD support would provide extra features X, Y, and Z (speed, slices, whatever). Then let the user choose. Seriously, I would have thought something like this would be obvious to programmers of the calibre found on these forums. I'm a little astonished that this would even be such a point of contention in the first place, since the solution is so simple. T
In other places it was said that it wasn't possible to build it on top of it. But yes, I would be expecting an entry point like you described and is something that I mentioned :) std.experimental.xml: - interfaces.d: interface Element {...} - entry.d: auto parseXML(...)(...) {...} - impl_subset: - dom.d ext. - impl_full: - entry.d ext.
Feb 12 2018
prev sibling parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 02/12/2018 05:02 PM, H. S. Teoh wrote:
 On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via
Digitalmars-d-announce wrote:
 [...]
 Everything you have mentioned is not in Phobos. Just because something
 is 'good enough' does not make it 'good enough' for Phobos. In the
 words of Andrei "Good enough is not good enough", we need to aim
 higher to show what we actually can do.
And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.
+Several billion. Like the improved assert messages we would've had since many years ago and was implemented, done and ready to go, but it was instead thrown away because...(and here's the real kicker, considering current D climate)...because it was a fully in-library solution instead of a new compiler feature. Go figure ::eyeroll::
 Seriously, I would have thought something like this would be obvious to
 programmers of the calibre found on these forums.  I'm a little
 astonished that this would even be such a point of contention in the
 first place, since the solution is so simple.
I would've expected so too, if it weren't that one of the top favorite activities 'round these parts is nitpicking reasonable ideas to death for stupid reasons. And, generally letting the perfect be the enemy of the good.
Feb 12 2018
prev sibling parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2018-02-12 at 14:54 +0000, rikki cattermole via Digitalmars-d-
announce wrote:
 [=E2=80=A6]
=20
 Personally I find J.M.D. arguments quite reasonable for a third-
 party=20
 library, since yes it does cover 90% of the use cases.
The problem is that std.xml needs removing to make it clear there is no good XML package in Phobos. The people will go looking in the Dub repository. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Feb 13 2018
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:
 XML parsers are one of those things that everyone seems to want 
 and no one seems to want to work on.
I wrote one 8 years ago... though mine is more focused on HTML parsing, and the XML aspect is just a side effect!
Feb 12 2018
prev sibling parent reply bachmeier <no spam.net> writes:
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
wrote:

 However, if folks as a whole think that Phobos' xml parser 
 needs to support the DTD section to be acceptable, then dxml 
 won't replace std.xml, because dxml is not going to implement 
 DTD support. DTD support fundamentally does not fit in with 
 dxml's design.
Can't you simply give it a name other than std.xml that indicates it doesn't do everything related to xml? It doesn't make sense to not put it into Phobos because of the name, and that should be an easy problem to solve.
Feb 12 2018
parent reply bachmeier <no spam.net> writes:
On Monday, 12 February 2018 at 15:43:59 UTC, bachmeier wrote:
 On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis 
 wrote:

 However, if folks as a whole think that Phobos' xml parser 
 needs to support the DTD section to be acceptable, then dxml 
 won't replace std.xml, because dxml is not going to implement 
 DTD support. DTD support fundamentally does not fit in with 
 dxml's design.
Can't you simply give it a name other than std.xml that indicates it doesn't do everything related to xml? It doesn't make sense to not put it into Phobos because of the name, and that should be an easy problem to solve.
Hit send too fast. std.xml.base would be reasonable.
Feb 12 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 15:45:50 bachmeier via Digitalmars-d-announce 
wrote:
 On Monday, 12 February 2018 at 15:43:59 UTC, bachmeier wrote:
 On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis

 wrote:
 However, if folks as a whole think that Phobos' xml parser
 needs to support the DTD section to be acceptable, then dxml
 won't replace std.xml, because dxml is not going to implement
 DTD support. DTD support fundamentally does not fit in with
 dxml's design.
Can't you simply give it a name other than std.xml that indicates it doesn't do everything related to xml? It doesn't make sense to not put it into Phobos because of the name, and that should be an easy problem to solve.
Hit send too fast. std.xml.base would be reasonable.
I have no interest in bikeshedding the name right now or even really arguing about Phobos inclusion (I've already said more in this thread about that than I probably should have). That can be left up to the review process, which already tends to be nasty enough that it wouldn't surprise me at all if dxml doesn't get accepted. The only reason that I have any plans to try for Phobos inclusion with dxml is because std.xml needs to be replaced. If Phobos didn't have an XML parser already, I don't expect that I'd bother, since I don't think that it's all that important that a standard library have an XML parser. I just think that it's important that it not have have a bad one. In general, I think that XML is the sort of thing that's perfectly fine as a 3rd party solution. - Jonathan M Davis
Feb 12 2018
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via
Digitalmars-d-announce wrote:
[...]
 However, if folks as a whole think that Phobos' xml parser needs to
 support the DTD section to be acceptable, then dxml won't replace
 std.xml, because dxml is not going to implement DTD support. DTD
 support fundamentally does not fit in with dxml's design.
Actually, thinking about this, I'm wondering if a combination of preprocessing and/or postprocessing might make it possible to implement DTD support without needing to rewrite the guts of dxml. AIUI, dxml does parse the DTD section correctly, i.e., as an XML directive, but only doesn't look into its internal details. So one way to implement DTD support might be: - Write an auxiliary parser that's basically a wrapper around dxml, forwarding XML events to the caller, except: - If a DTD event is encountered, eagerly parse it, store DTD declarations internally for future reference. - If there's a DTD that has been seen, perform on-the-fly validation as XML events are forwarded. - In PCDATA sections, if there are entity references to the DTD, expand them, possibly inserting more XML events into the stream based on what's defined in the DTD. (This may need to reuse some dxml internals to parse XML snippets that might be contained in an entity definition, for example.) [...]
 However, std.xml does not support the DTD section, and glancing over
 it, it doesn't look like it even handles skipping the DTD section
 properly (it doesn't handle the fact that '>' can appear within quoted
 sections within the DTD). So, dxml is not worse than std.xml in that
 regard, and we wouldn't lose any functionality by having dxml replace
 std.xml. It just wouldn't necessarily do as much as some folks might
 like.
[...] If std.xml currently does not support DTDs, then I say dxml is definitely a Phobos candidate. At the very least, it does not make the current situation worse. Rejecting dxml because it doesn't support DTDs is basically letting the perfect be the enemy of the good, which is something this community has been plagued with for far too long. What's worse: a std.dxml that doesn't support DTDs, or a std.xml with fundamental problems that continue to plague us for the next decade while nobody else steps up to implement a suitable replacement? T -- Ph.D. = Permanent head Damage
Feb 12 2018
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 12/02/2018 3:59 PM, H. S. Teoh wrote:
 If std.xml currently does not support DTDs, then I say dxml is
 definitely a Phobos candidate.  At the very least, it does not make the
 current situation worse.  Rejecting dxml because it doesn't support DTDs
 is basically letting the perfect be the enemy of the good, which is
 something this community has been plagued with for far too long.  What's
 worse: a std.dxml that doesn't support DTDs, or a std.xml with
 fundamental problems that continue to plague us for the next decade
 while nobody else steps up to implement a suitable replacement?
dxml 7.5k LOC std.xml 3k LOC dxml would make the situation a lot worse.
Feb 12 2018
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 12 February 2018 at 16:15:54 UTC, rikki cattermole 
wrote:

 dxml 7.5k LOC
 std.xml 3k LOC

 dxml would make the situation a lot worse.
How could it possibly make the situation any worse than it is now? Atm, nobody will ever use std.xml, because it is sub-standard and has no future. As others have already mentioned: a DTD parser can still be added at a later point. It's like not moving into newly built house, because the winter garden is not yet finished (and you live in Florida :)
Feb 12 2018
parent reply Jacob Carlborg <doob me.com> writes:
On 2018-02-12 17:49, Chris wrote:

 How could it possibly make the situation any worse than it is now? Atm,
 nobody will ever use std.xml, because it is sub-standard and has no future.
I'm using std.xml in a new project right now. It's a really small private project that just need to extracts some data from an XML document. I started it a couple of days before dxml was announced. -- /Jacob Carlborg
Feb 12 2018
parent reply Chris <wendlec tcd.ie> writes:
On Monday, 12 February 2018 at 19:47:09 UTC, Jacob Carlborg wrote:
 On 2018-02-12 17:49, Chris wrote:

 How could it possibly make the situation any worse than it is 
 now? Atm,
 nobody will ever use std.xml, because it is sub-standard and 
 has no future.
I'm using std.xml in a new project right now. It's a really small private project that just need to extracts some data from an XML document. I started it a couple of days before dxml was announced.
A few lines of code that could be replaced easily once something better is available? But who will start an important commercial project with std.xml when it says in red letters: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." I for my part wouldn't and I'm glad there's dxml now.
Feb 12 2018
parent Jacob Carlborg <doob me.com> writes:
On 2018-02-12 21:19, Chris wrote:

 A few lines of code that could be replaced easily once something better 
 is available?
Fairly easy because it's so small. I'm actually using the SAX interface from std.xml and it quite nicely fits my needs. -- /Jacob Carlborg
Feb 12 2018
prev sibling parent reply "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 02/12/2018 11:15 AM, rikki cattermole wrote:
 
 dxml 7.5k LOC
 std.xml 3k LOC
 
 dxml would make the situation a lot worse.
4.5k LOC == "a lot worse"? Uuuuhhh...WAT?
Feb 12 2018
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 21:53:21 Nick Sabalausky  via Digitalmars-d-
announce wrote:
 On 02/12/2018 11:15 AM, rikki cattermole wrote:
 dxml 7.5k LOC
 std.xml 3k LOC

 dxml would make the situation a lot worse.
4.5k LOC == "a lot worse"? Uuuuhhh...WAT?
There is sometimes a tendency for folks to think that something having a lot of lines of code is bad, and there can be some truth to that. If something can be done in a simpler way, it tends to be shorter and easier to maintain, but shorter isn't always better, and simpler isn't always better - especially if that complexity is needed to get the job done. So, LOC tells you something, but what it really tells you is up for debate. And actually, well-written D code is going to have a much higher line count in general because of stuff like documentation and unit tests being in the source file. In this case, while std.xml does seem to have a fair bit of documentation, it has very little in the way of unit tests, whereas dxml has fairly thorough unit tests - maybe not quite as extreme as std.datetime, but I do tend to be thorough with unit tests. Andrei used to complain periodically about how large std.datetime was, thinking that it was way too much code, and then someone actually went to the effort of stripping out all of the comments and unit tests and whatnot to count the actual lines of code in the implementation, and it was a _way_ smaller number than the lines in the file (IIRC, it might have even been something like only 10% of the file, if that). That's what happens when you write documentation and unit tests that are thorough. - Jonathan M Davis
Feb 12 2018
parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 02/12/2018 10:49 PM, Jonathan M Davis wrote:
 
 Andrei used to complain periodically about how large std.datetime was,
 thinking that it was way too much code, and then someone actually went to
 the effort of stripping out all of the comments and unit tests and whatnot
 to count the actual lines of code in the implementation, and it was a _way_
 smaller number than the lines in the file (IIRC, it might have even been
 something like only 10% of the file, if that). That's what happens when you
 write documentation and unit tests that are thorough.
 
Yea, totally. Another example: mysql-native used to be one (!!) source file. It was maybe a bit on the large size for a single module, but it was still workable. In the last several years, that library has grown many times its old size. But now, I'd say that easily the majority of lines are either comments or tests. The *actual* implementation and API isn't really all that much more LOC than it used to be. The original one-module version, by contrast, was less documented and had...I don't think it even had a single test (IIRC, the now-old-and-probably-bitrotted "app.d" wasn't even there.)
Feb 12 2018
prev sibling parent Kagamin <spam here.lot> writes:
On Tuesday, 13 February 2018 at 02:53:21 UTC, Nick Sabalausky 
(Abscissa) wrote:
 On 02/12/2018 11:15 AM, rikki cattermole wrote:
 
 dxml 7.5k LOC
 std.xml 3k LOC
 
 dxml would make the situation a lot worse.
4.5k LOC == "a lot worse"? Uuuuhhh...WAT?
And it's like 2k LOC of code and 5.5k LOC of tests and docs.
Feb 13 2018
prev sibling next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 07:59:24 H. S. Teoh via Digitalmars-d-announce 
wrote:
 On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via
 Digitalmars-d-announce wrote: [...]

 However, if folks as a whole think that Phobos' xml parser needs to
 support the DTD section to be acceptable, then dxml won't replace
 std.xml, because dxml is not going to implement DTD support. DTD
 support fundamentally does not fit in with dxml's design.
Actually, thinking about this, I'm wondering if a combination of preprocessing and/or postprocessing might make it possible to implement DTD support without needing to rewrite the guts of dxml. AIUI, dxml does parse the DTD section correctly, i.e., as an XML directive, but only doesn't look into its internal details. So one way to implement DTD support might be: - Write an auxiliary parser that's basically a wrapper around dxml, forwarding XML events to the caller, except: - If a DTD event is encountered, eagerly parse it, store DTD declarations internally for future reference. - If there's a DTD that has been seen, perform on-the-fly validation as XML events are forwarded. - In PCDATA sections, if there are entity references to the DTD, expand them, possibly inserting more XML events into the stream based on what's defined in the DTD. (This may need to reuse some dxml internals to parse XML snippets that might be contained in an entity definition, for example.)
The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference. If we were going to stick to strings and only strings, it would be quite possible to define the API in a way that it may or may not do DTD processing, but that doesn't work with arbitrary ranges of characters, not unless you give up on returning slices of the original input, and that means harming the performance and usability for the common case in order to support DTDs. Also, anything that has the concept of "events" would be drastically different from what dxml does. dxml is completely range-based. It has no callbacks or anything of the sort, and having anything like that would complicate it considerably. There are lots of interesting things that could be done to try and deal with the DTD section, but they fundamentally don't work with returning slices of the original input unless you're only using strings. In any case, I refuse to change dxml so that it has DTD support, and I refuse to change it so that it doesn't return slices of the original input. If I were to do so, it would make the parser worse for any use case I care about and require a lot of time and effort on my part that I'm not willing to spend. So, if that makes it so that dxml is never included in Phobos, then so be it. Folks are free to decide to support dxml for inclusion when the time comes and free to vote it as unacceptable. Personally, I think that dxml's approach is ideal for XML that doesn't use entity references, and I'd much rather use that kind of parser regardless of whether it's in the standard library or not. I think that the D community would be far better off with std.xml being replaced by dxml, but whatever happens happens. I'd be just as fine with a decision to remove std.xml and not include dxml. I'm less fine with std.xml being left in Phobos and dxml being rejected, because std.xml has been recognized as bad, and it sure doesn't look like anyone else is going to write a replacement any time soon. I also think that dxml's approach is better for the common case than anything that supported DTDs would be, so I think that having dxml's solution in Phobos would be better for the community even if Phobos also had a solution that supported DTDs, but at this point, it looks like the options are going to be 1. std.xml stays and continues to suck. 2. std.xml gets ripped out and dxml replaces it. 3. std.xml gets ripped out and we have no xml solution in Phobos. But as it stands, it doesn't seem likely that any XML solution that supports DTDs being in Phobos is likely to happen any time soon, if ever, because AFAIK, only three people have put in any real effort towards replacing std.xml since 2010 or whenever it was that we decided it needed to be replaced. The first two people both disappeared into oblivion without ever finishing, and here I am with a working StAX parser (now with DOM support) and an XML writer in the works - and given how involved I am with D, I think that it's pretty unlikely that I'm disappearing anywhere short of getting hit by a bus or whatnot. So, at least I've actually put in the time and effort towards a solution and made it available, and it will almost certainly be an essentially complete solution by the time that dconf rolls around if not well before. So, I do expect that the question of Phobos inclusion will ultimately be a question of whether std.xml _ever_ gets replaced, but regardless, at least there is a solution, and it will continue to be available as a 3rd party library even if it never makes it into Phobos. - Jonathan M Davis
Feb 12 2018
next sibling parent reply Kagamin <spam here.lot> writes:
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis 
wrote:
 The core problem is that entity references get replaced with 
 more XML that needs to be parsed. So, they can't simply be 
 passed on for post-processing. As I understand it, they have to 
 be replaced while the parsing is going on. And that means that 
 you can't do something like return slices of the original input 
 that don't bother with the entity references and then have a 
 separate parser take that and process it further to deal with 
 the entity references. The first parser has to deal with them, 
 and that means not returning slices of the original input 
 unless you're dealing purely with strings and are willing to 
 allocate new strings in the cases where the data needs to be 
 mutated because of an entity reference.
Standard entities like &amp; have the same problem, so the same solution should work too.
Feb 13 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, February 13, 2018 15:22:32 Kagamin via Digitalmars-d-announce 
wrote:
 On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis

 wrote:
 The core problem is that entity references get replaced with
 more XML that needs to be parsed. So, they can't simply be
 passed on for post-processing. As I understand it, they have to
 be replaced while the parsing is going on. And that means that
 you can't do something like return slices of the original input
 that don't bother with the entity references and then have a
 separate parser take that and process it further to deal with
 the entity references. The first parser has to deal with them,
 and that means not returning slices of the original input
 unless you're dealing purely with strings and are willing to
 allocate new strings in the cases where the data needs to be
 mutated because of an entity reference.
Standard entities like &amp; have the same problem, so the same solution should work too.
That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like &amp; can be, since &amp; is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out. If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference. - Jonathan M Davis
Feb 13 2018
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis 
wrote:
 On Tuesday, February 13, 2018 15:22:32 Kagamin via 
 Digitalmars-d-announce wrote:
 On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis

 wrote:
 The core problem is that entity references get replaced with 
 more XML that needs to be parsed. So, they can't simply be 
 passed on for post-processing. As I understand it, they have 
 to be replaced while the parsing is going on. And that means 
 that you can't do something like return slices of the 
 original input that don't bother with the entity references 
 and then have a separate parser take that and process it 
 further to deal with the entity references. The first parser 
 has to deal with them, and that means not returning slices 
 of the original input unless you're dealing purely with 
 strings and are willing to allocate new strings in the cases 
 where the data needs to be mutated because of an entity 
 reference.
Standard entities like &amp; have the same problem, so the same solution should work too.
That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like &amp; can be, since &amp; is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out. If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.
There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser. <!DOCTYPE root [ <!ELEMENT root ANY> <!ENTITY LOL "LOL"> <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;"> <!ENTITY LOL2 "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;"> <!ENTITY LOL3 "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;"> <!ENTITY LOL4 "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;"> <!ENTITY LOL5 "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;"> <!ENTITY LOL6 "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;"> <!ENTITY LOL7 "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;"> <!ENTITY LOL8 "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;"> <!ENTITY LOL9 "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;"> ]> <root>&LOL9;</root> Hope you have enough memory (this expands to a 3 000 000 000 LOL's)
Feb 13 2018
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d-
announce wrote:
 There's also the issue that entity references open a whole can of
 worms concerning security. It quite possible to have an
 exponential growing entity replacement that can take down any
 parser.
Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were. The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section. If entity references are only legal in the text between start and end tags and between the quotes of attribute values, and whatever they're replaced with cannot actually affect anything else in the XML document (i.e. it can't just be a start or end tag or anything like that - it has to be fulling parseable on its own and not affect the parsing of the document itself), then passing them along should be fine. Basically, if I can change dxml so that in the places where it currently allows one of the standard entity references to be, it then also allows other entity references but passes them along without replacing them instead of throwing an XMLParsingException, and that works without having documents be screwed up due to missing start tags or something, then passing them along should be fine. But if entity references allow arbitrary enough chunks of XML, that doesn't work. It also doesn't work if entity references are allowed in places other than the text between start and end tags or within attribute values. And it's not clear to me at all what is legal in an entity reference or where exactly they're legal. The spec talks about the grammar being the grammar _after_ all of the references have been replaced, which makes the grammar rather untrustworthy, and I find the spec very hard to understand in general. Regardless, there's no risk of dxml's parser ever being changed to actually replace entity references. That doesn't work with returning slices of the original input, and it really doesn't work with a parser that's just supposed to take a range of characters and parse it. To fully handle all of the DTD stuff means actually reading files from disk or from the internet - which of course is where the security problems come in, but it also means that you're not just dealing with a parser anymore. In principle, dxml's parser should be pure (though some implementation make it so that it isn't right now), whereas an XML parser that fully handles the DTD section could never be pure. - Jonathan M Davis
Feb 13 2018
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis 
wrote:
 On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via 
 Digitalmars-d- announce wrote:
 [...]
Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were. The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.
Yikes! In any case, even if I had to implement a parser I would tend to not implement this "feature" as it sounds quite unreasonable. Only if a real need (i.e. one in the real world, not one that could be contrived out of the specs) arises would I then potentially implement the real deal.
Feb 14 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, February 14, 2018 10:03:45 Patrick Schluter via Digitalmars-d-
announce wrote:
 On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis

 wrote:
 On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via

 Digitalmars-d- announce wrote:
 [...]
Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were. The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.
Yikes! In any case, even if I had to implement a parser I would tend to not implement this "feature" as it sounds quite unreasonable. Only if a real need (i.e. one in the real world, not one that could be contrived out of the specs) arises would I then potentially implement the real deal.
Well, since folks other than me are going to use this parser, and it's even potentially going to end up in D's standard library, it needs to at least be good enough to not let through invalid XML or incorrectly interpret any XML. It can potentially not support portions of the spec as long as it does so in a clear and clean manner, but it's going to have to correctly handle anything that it does handle. For better or worse, I'm the sort of person who prefers to completely implement a spec when I'm implementing one, but in this case, it wasn't really reasonable. Fortunately however, from the perspective of implementing something that's useful for me personally, the DTD section is completely unnecessary. From that perspective, processing instructions and CDATA sections are also unnecessary, since I'd never do anythnig with them, but I don't think that it would be reasonable to skip those, so they're implemented. And it's not like they're hard to implement support for, unlike the DTD section. - Jonathan M Davis
Feb 14 2018
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 13, 2018 at 09:18:12PM +0000, Patrick Schluter via
Digitalmars-d-announce wrote:
 On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:
[...]
 If it's 100% sure that entity references can be treated as just text
 and that you can't end up with stuff like start tags or end tags
 being inserted and messing with the parsing such that they all have
 to be replaced for the XML to be correctly parsed, then I have no
 problem passing entity references along, and a higher level parser
 could try to do something with them, but it's not clear to me at all
 that an XML document with entity references is correct enough to be
 parsed while not replacing the entity references with whatever XML
 markup they contain. I had originally passed them along with the
 idea that a higher level parser could do something with them, but I
 decided that I couldn't do that if you could do something like drop
 a start tag in there and change the meaning of the stuff that needs
 to be parsed that isn't directly in the entity reference.
This made me go to the W3C spec (https://www.w3.org/TR/xml/) to figure out what exactly is/isn't defined. I discovered to my chagrin that XML entities are a huge rabbit hole with extremely pathological behaviour that makes it almost impossible to implement in any way that's even remotely efficient. Here's a page with examples of how nasty it can get: http://www.floriankaeferboeck.at/XML/Comparison.html Here's an example given in the W3C spec itself: <?xml version='1.0'?> <!DOCTYPE test [ <!ELEMENT test (#PCDATA) > %xx; ]> <test>This sample shows a &tricky; method.</test> A correct XML parser is supposed to produce the following text as the body of the <test>...</test> tag (the grammatical error is intentional): This sample shows a error-prone method. Fortunately, there's a glimmer of hope on the horizon: in section 4.3.2 of the spec (https://www.w3.org/TR/xml/#wf-entities), it is explicitly stated: A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another. Meaning, if I understand it correctly, that you can't have a start tag in &entity1; and its corresponding end tag in &entity2;, and then have your document contain "&entity1; &entity2;". This is because the body of the entity can only contain text or entire tags (the production "content" in the spec); an entity that contains an open tag without an end tag (or vice versa) does not match this rule and is thus illegal. So this means that we *can* use dxml as a backend to drive a DTD-supporting XML parser implementation. The wrapper / higher-level parser would scan the slices returned by dxml for entity references, and substitute them accordingly, which may involve handing the body of the entity to another instance of dxml to parse any tags that may be nested in there. The nastiness involving partially-formed entity references (as seen in the above examples) apparently only applies inside the DOCTYPE declaration, so AIUI this can be handled by the higher-level parser as part of replacing inline entities with their replacement text. (The higher-level parser has a pretty tall order to fill, though, because entities can refer to remote resources via URI, meaning that an innocuous-looking 5-line XML file can potentially expand to terabytes of XML tags downloaded from who knows how many external resources recursively. Not to mention a bunch of security issues like described below.)
 There's also the issue that entity references open a whole can of
 worms concerning security. It quite possible to have an exponential
 growing entity replacement that can take down any parser.
 
 <!DOCTYPE root [
  <!ELEMENT root ANY>
  <!ENTITY LOL "LOL">
  <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;">
  <!ENTITY LOL2
 "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;">
  <!ENTITY LOL3
 "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;">
  <!ENTITY LOL4
 "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;">
  <!ENTITY LOL5
 "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;">
  <!ENTITY LOL6
 "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;">
  <!ENTITY LOL7
 "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;">
  <!ENTITY LOL8
 "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;">
  <!ENTITY LOL9
 "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;">
 ]>
 <root>&LOL9;</root>
 
 Hope you have enough memory (this expands to a 3 000 000 000 LOL's)
[...] Yeah, after reading through relevant portions of the spec, I have to say that full DTD support is a HUGE can of worms. I tip my hats off in advance to the brave soul (or poor fool :-P) who would attempt to implement the spec in full. :-D There are ways to deal with exponential entity growth, e.g., if the expansion was carried out lazily. But it's still a DOS vulnerability if the software then spins practically forever trying to traverse the huge range of stuff being churned out. Not to mention that having embedded external references is itself a security issue, particular since the partial entity formation thing can be used to obfuscate the real URI of a referenced entity, so you could potentially trick a remote XML parser to download stuff from questionable sources. It could be used as a covert surveillance method, for example, or a malware delivery vector, if combined with an exploitable bug in the parser code. Or it could be used to read sensitive files (e.g., if an entity references file:///etc/passwd or some such system file). Ick. Ironically, the general advice I found online w.r.t XML vulnerabilities is "don't allow DTDs", "don't expand entities", "don't resolve externals", etc.. There also aren't many XML parsers out there that fully support all the features called for in the spec. IOW, this basically amounts to "just use dxml and forget about everything else". :-D Now of course, there *are* valid use cases for DTDs... but a naïve implementation of the spec is only going to end in tears. My current inclination is, just merge dxml into Phobos, then whoever dares implement DTD support can do so on top of dxml, and shoulder their own responsibility for vulnerabilities or whatever. (I mean, seriously, just for the sake of being able to say "my XML is validated" we have to implement network access, local filesystem access, a security framework, and what amounts to a sandbox to control pathological behaviour like exponentially recursive entities? And all of this, just to handle rare corner cases? That's completely ridiculous. It's an obvious design smell to me. The only thing missing from this poisonous mix is Turing completeness, which would have made XML hackers' heaven. Oh wait, on further googling, I see that XSLT *is* Turing complete. Great, just great. Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.) T -- English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall
Feb 13 2018
parent Chris <wendlec tcd.ie> writes:
On Tuesday, 13 February 2018 at 22:13:36 UTC, H. S. Teoh wrote:

 Ironically, the general advice I found online w.r.t XML 
 vulnerabilities is "don't allow DTDs", "don't expand entities", 
 "don't resolve externals", etc..  There also aren't many XML 
 parsers out there that fully support all the features called 
 for in the spec.  IOW, this basically amounts to "just use dxml 
 and forget about everything else". :-D

 Now of course, there *are* valid use cases for DTDs... but a 
 naïve implementation of the spec is only going to end in tears.
  My current inclination is, just merge dxml into Phobos, then 
 whoever dares implement DTD support can do so on top of dxml, 
 and shoulder their own responsibility for vulnerabilities or 
 whatever.  (I mean, seriously, just for the sake of being able 
 to say "my XML is validated" we have to implement network 
 access, local filesystem access, a security framework, and what 
 amounts to a sandbox to control pathological behaviour like 
 exponentially recursive entities?  And all of this, just to 
 handle rare corner cases?  That's completely ridiculous.  It's 
 an obvious design smell to me.  The only thing missing from 
 this poisonous mix is Turing completeness, which would have 
 made XML hackers' heaven.  Oh wait, on further googling, I see 
 that XSLT *is* Turing complete.  Great, just great.   Now I 
 know why I've always had this gut feeling that *something* is 
 off about the whole XML mania.)


 T
Thanks for the analysis. I'd say you're right. It makes no sense to keep dxml from becoming std.xml's successor only because it doesn't support DTDs. Also, as I said before, if we had DTD support in std.xml, people would complain about the lack of efficiency, and the discussion about interpreting the specs correctly, implementing them 100%, complaints about the lack of security would just never end.
Feb 14 2018
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 13, 2018 at 03:00:59PM -0700, Jonathan M Davis via
Digitalmars-d-announce wrote:
[...]
 The big problem is how the entity references affect the parsing. If
 start tags can be dropped in and affect the parsing (and it's still
 not clear to me from the spec whether that's legal - there is a
 section talking about being nested properly which might indicate that
 that's not legal, but it's not very specific or clear), and if it's
 legal to do something like use an entity reference for a tag name -
 e.g. <&foo;>, then that's a serious problem. And problems like that
 are the main reason why I completely dropped any attempt to do
 anything with the DTD section.
AFAICT, section 4.3.2 in the spec (probably the one you're referring to) seems to be saying that you can't do that: A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.
 If entity references are only legal in the text between start and end
 tags and between the quotes of attribute values, and whatever they're
 replaced with cannot actually affect anything else in the XML document
 (i.e. it can't just be a start or end tag or anything like that - it
 has to be fulling parseable on its own and not affect the parsing of
 the document itself), then passing them along should be fine.
That's the approach I'm thinking of. [...]
 Regardless, there's no risk of dxml's parser ever being changed to
 actually replace entity references. That doesn't work with returning
 slices of the original input, and it really doesn't work with a parser
 that's just supposed to take a range of characters and parse it. To
 fully handle all of the DTD stuff means actually reading files from
 disk or from the internet - which of course is where the security
 problems come in, but it also means that you're not just dealing with
 a parser anymore. In principle, dxml's parser should be pure (though
 some implementation make it so that it isn't right now), whereas an
 XML parser that fully handles the DTD section could never be pure.
[...] Given the insane complexities of DTD that I'm only slowly beginning to grasp from actually reading the spec, I'm quickly adopting the opinion that dxml should remain as-is, and any DTD implementation should be layered on top. The only potential changes that might be needed is: - provide a way to parse XML snippets that don't have a <?xml ...> declaration, so that a DTD implementation could, for example, hand an entity body over to dxml to extract any tags that may be nested in there (and if my reading of section 4.3.2 is correct, all such tags must always be closed inside the entity body, so there should be no errors produced). - provide some way of hooking into non-default entities so that DTD-defined entities can be expanded by the DTD implementation. This could be as simple as leaving such entities untouched in the returned range, or invent a special EntityType representing such entities (with a slice of the input containing the entity name) so that the DTD implementation can insert the replacement text. Everything else should be handled by the DTD layer, e.g., parsing the DOCTYPE section (which is itself pretty pathological, given the actual examples in the W3C spec to this effect), expanding entities, looking up external entities, limiting recursive entity expansion, implementing a security model, etc.. T -- Why do conspiracy theories always come from the same people??
Feb 13 2018
parent reply Kagamin <spam here.lot> writes:
On Tuesday, 13 February 2018 at 22:29:27 UTC, H. S. Teoh wrote:
 - provide some way of hooking into non-default entities so that
   DTD-defined entities can be expanded by the DTD 
 implementation.
The parser now returns raw text, entity replacement can be done by DTD processor without any modification of API. So it's good for experimental if there's incentive to maintain it, but it's purely PR problem: there's nothing wrong in having xml support in dub registry and std.xml in phobos, if phobos is ok with it, it can stay as is. It looks like EntityRange requires forward range, is it ok for a parser?
Feb 14 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce 
wrote:
 It looks like EntityRange requires forward range, is it ok for a
 parser?
It's very difficult in general to write a parser that isn't at least a forward range, because without that, you're stuck at only one character of look ahead unless you play a lot of games with putting data from the input range in a buffer so that you can keep it around to look at it again after you've looked farther ahead. Honestly, pure input ranges are borderline useless for a _lot_ of cases. It's generally only the cases where you only care about operating on each element individually irrespective of what's going on with other elements in the range that pure input ranges are really useable, and parsing definitely doesn't fall into that camp. - Jonathan M Davis
Feb 14 2018
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 14/02/2018 10:32 AM, Jonathan M Davis wrote:
 On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce
 wrote:
 It looks like EntityRange requires forward range, is it ok for a
 parser?
It's very difficult in general to write a parser that isn't at least a forward range, because without that, you're stuck at only one character of look ahead unless you play a lot of games with putting data from the input range in a buffer so that you can keep it around to look at it again after you've looked farther ahead. Honestly, pure input ranges are borderline useless for a _lot_ of cases. It's generally only the cases where you only care about operating on each element individually irrespective of what's going on with other elements in the range that pure input ranges are really useable, and parsing definitely doesn't fall into that camp. - Jonathan M Davis
See lines: - Input!IR temp = input; - input = temp; bool commentLine() { Input!IR temp = input; if (!temp.empty && temp.front.c == '/') { temp.popFront; if (!temp.empty && temp.front.c == '/') temp.popFront; else return false; } else return false; if (!temp.empty) { size_t endOffset = temp.front.location.fileOffset; while(temp.front.location.lineOffset != 0) { endOffset = temp.front.location.fileOffset; temp.popFront; if (temp.empty) { endOffset++; break; } } current.type = Token.Type.Comment_Line; current.location = input.front.location; current.location.length = endOffset - input.front.location.fileOffset; input = temp; return true; } else return false; }
Feb 14 2018
parent reply Adrian Matoga <dlang.spam matoga.info> writes:
On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole 
wrote:
 See lines:
 - Input!IR temp = input;
 - input = temp;

            bool commentLine() {
 		Input!IR temp = input;

 (...)
 		if (!temp.empty) {
 (...)		
 			input = temp;
 			return true;
 		} else
 			return false;
 	}
`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Feb 14 2018
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 14/02/2018 2:02 PM, Adrian Matoga wrote:
 On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:
 See lines:
 - Input!IR temp = input;
 - input = temp;

            bool commentLine() {
         Input!IR temp = input;

 (...)
         if (!temp.empty) {
 (...)
             input = temp;
             return true;
         } else
             return false;
     }
`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Ah I must be thinking of ranges that support indexing.
Feb 14 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d-
announce wrote:
 On 14/02/2018 2:02 PM, Adrian Matoga wrote:
 On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:
 See lines:
 - Input!IR temp = input;
 - input = temp;

            bool commentLine() {
         Input!IR temp = input;

 (...)
         if (!temp.empty) {
 (...)
             input = temp;
             return true;
         } else
             return false;
     }
`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Ah I must be thinking of ranges that support indexing.
Random access ranges are also forward ranges and would require a call to save here. - Jonathan M Davis
Feb 14 2018
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 14/02/2018 5:13 PM, Jonathan M Davis wrote:
 On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d-
 announce wrote:
 On 14/02/2018 2:02 PM, Adrian Matoga wrote:
 On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:
 See lines:
 - Input!IR temp = input;
 - input = temp;

             bool commentLine() {
          Input!IR temp = input;

 (...)
          if (!temp.empty) {
 (...)
              input = temp;
              return true;
          } else
              return false;
      }
`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Ah I must be thinking of ranges that support indexing.
Random access ranges are also forward ranges and would require a call to save here. - Jonathan M Davis
Luckily in my code I can forget that ;)
Feb 14 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, February 15, 2018 01:55:28 rikki cattermole via Digitalmars-d-
announce wrote:
 On 14/02/2018 5:13 PM, Jonathan M Davis wrote:
 On Wednesday, February 14, 2018 14:09:21 rikki cattermole via
 Digitalmars-d->
 announce wrote:
 On 14/02/2018 2:02 PM, Adrian Matoga wrote:
 On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole 
wrote:
 See lines:
 - Input!IR temp = input;
 - input = temp;

             bool commentLine() {

          Input!IR temp = input;

 (...)

          if (!temp.empty) {

 (...)

              input = temp;
              return true;

          } else

              return false;

      }
`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Ah I must be thinking of ranges that support indexing.
Random access ranges are also forward ranges and would require a call to save here. - Jonathan M Davis
Luckily in my code I can forget that ;)
LOL. That's actually part of what makes writing range-based libraries so much harder to get right than simply using ranges in your program. When a piece of code is used with only a few types of ranges (or even only one type of range, as is often the case), then it's generally not very hard to write code that works just fine, but as soon as you have to worry about arbitrary ranges, you get all kinds of nonsense that you have to worry about in order to make sure that the code works correctly for any range that's passed to it. save is the classic example of something that a lot of range-based code gets wrong, because for most ranges, it really doesn't matter, but for those ranges where it does, a single missed call to save results in code that doesn't work properly. To get it right, you basically have to call save every time you pass a range to a range-based function that is not supposed to consume the range, and folks rarely get that right. Certainly, pretty much any range-based code that doesn't have unit tests which include reference-type ranges is going to be wrong for reference-type ranges. Even Phobos has had quite a few issues with that historically. - Jonathan M Davis
Feb 14 2018
parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 15 February 2018 at 02:40:03 UTC, Jonathan M Davis 
wrote:
 LOL. That's actually part of what makes writing range-based 
 libraries so much harder to get right than simply using ranges 
 in your program. [snip]
That sounds like an interesting topic for a blog post.
Feb 15 2018
prev sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, February 13, 2018 14:13:36 H. S. Teoh via Digitalmars-d-announce 
wrote:
 Great, just
 great.   Now I know why I've always had this gut feeling that
 *something* is off about the whole XML mania.)
Well, there are plenty of folks who talk like XML is a pile of steaming muck that should never be used (and then usually talk about how great JSON is). I think that basic XML is actually pretty okay - basically the subset that dxml supports, though if I were designing XML I'd take it a bit further. Personally, I'd make XML documents completely recursive - meaning that the top level is the same as any deeper level, so you could have as many element tags at the top level as you want and as much text as you want, whereas XML requires a root element and only allows stuff like processing instructions, comments, and the DOCTYPE stuff outside of the root element. I'd get rid of the <?xml...?> and <!DOCTYPE...> declarations as well as processing instructions, and I'd probably get rid of the CDATA section in favor of escaping characters with backslashes like you typically do in strings (or in JSON), and related to that, I'd get rid of the predefined entity references, making stuff like & legal. I also might get rid of empty element tags becase they're annoying to deal with when parsing, but they do reduce the verbosity of the document such that they might be worth keeping. It's also tempting to get rid of the tag name on end tags, which would actually make parsing much easier, but having them helps the legibility of XML documents, and it's a bit like semicolons in D in the sense that they can help ensure that error messages refer to the right thing rather than something later in the document, so I don't know. I'd also allow all Unicode characters instead of disallowing a number of them, since it won't really matter for most documents, and then the parser doesn't need to care about them when validating. So, basically, you end up with start tags, end tags, and comments, with start tags optionally having attributes. backslashes would then be used for escaping stuff, and you end up with something pretty dead simple. However, as you're finding out when reading through the XML spec, the folks who created XML didn't think like that at all, and were clearly coming from a _very_ different point of view as to what an XML document was for and should contain. But as you might imagine, given my take on what XML should have been, finding out in detail what XML actually _is_ was pretty horrifying. I started dxml with the intention of fully implementing all aspects of the spec but ultimately decided that it simply wasn't worth it. - Jonathan M Davis
Feb 13 2018
prev sibling parent reply nkm1 <t4nk074 openmailbox.org> writes:
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis 
wrote:
 Folks are free to decide to support dxml for inclusion when the 
 time comes and free to vote it as unacceptable. Personally, I 
 think that dxml's approach is ideal for XML that doesn't use 
 entity references, and I'd much rather use that kind of parser 
 regardless of whether it's in the standard library or not. I 
 think that the D community would be far better off with std.xml 
 being replaced by dxml, but whatever happens happens.
Bump! I'm using dxml now, and it's a very good library. So I thought "it should be in Phobos instead of std.xml" and searched the newsgroup. Sorry for necroposting. Anyway, what I wanted to say is just take an example from Perl and call it std.xml.simple. Then people would know what to expect from it and would use it (because everyone likes simple). That would also leave a way to include std.xml.full (or some such) at some indefinite point in the future. Which is, in practice, probably never - and that's fine, because who needs DTD? screw it... Anyway, thanks for the library, Jonathan.
Aug 30 2018
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 30, 2018 at 07:26:28PM +0000, nkm1 via Digitalmars-d-announce wrote:
 On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:
 Folks are free to decide to support dxml for inclusion when the time
 comes and free to vote it as unacceptable. Personally, I think that
 dxml's approach is ideal for XML that doesn't use entity references,
 and I'd much rather use that kind of parser regardless of whether
 it's in the standard library or not. I think that the D community
 would be far better off with std.xml being replaced by dxml, but
 whatever happens happens.
+1. I vote for adding dxml to Phobos. [...]
 I'm using dxml now, and it's a very good library. So I thought "it
 should be in Phobos instead of std.xml" and searched the newsgroup.
 Sorry for necroposting. Anyway, what I wanted to say is just take an
 example from Perl and call it std.xml.simple. Then people would know
 what to expect from it and would use it (because everyone likes
 simple). That would also leave a way to include std.xml.full (or some
 such) at some indefinite point in the future. Which is, in practice,
 probably never - and that's fine, because who needs DTD? screw it...
[...] That's a good idea, actually. That will stop people who expect full DTD support from complaining that it's not supported by the standard library. I vote for adding dxml to Phobos as std.xml.simple. We can either leave std.xml as-is, or deprecate it and work on std.xml.full (or std.xml.complex, or whatever). The current state of std.xml gives a poor impression to anyone coming to D the first time and wanting to work with XML, and having std.xml.simple would be a big plus. T -- This is not a sentence.
Sep 13 2018
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Feb 12, 2018 at 09:50:16AM -0700, Jonathan M Davis via
Digitalmars-d-announce wrote:
[...]
 The core problem is that entity references get replaced with more XML
 that needs to be parsed. So, they can't simply be passed on for
 post-processing.  As I understand it, they have to be replaced while
 the parsing is going on.  And that means that you can't do something
 like return slices of the original input that don't bother with the
 entity references and then have a separate parser take that and
 process it further to deal with the entity references. The first
 parser has to deal with them, and that means not returning slices of
 the original input unless you're dealing purely with strings and are
 willing to allocate new strings in the cases where the data needs to
 be mutated because of an entity reference.
[...] I think you missed my point. What I'm trying to say is, given the current functionality of dxml, one *can* build an XML interface that implements DTD support. Of course, some concessions obviously have to be made, such as needing to allocate memory (I don't see how else one could keep a dictionary of DTD rules / entity declarations otherwise, for example), or not being able to return only slices of the input anymore. For example, entity support pretty much means plain slices are no longer an option, because you have to perform substitution of entity definitions, so you'll have to either wrap it in some kind of lazy range that chains the entity definition to the surrounding text, or you'l have to use strings or something else. Which means you'll need to have memory allocation / slower parsing / whatever, but that's the price of DTD support. But again, the point is, basic XML parsing (without DTD support) doesn't *need* to pay this price. What's currently in dxml doesn't need to change. DTD support can be implemented in a submodule / separate module that wraps around dxml and builds DTD support on top of it. Put another way, we can implement DTD support *on top of* dxml this way: - Parse the XML using dxml as an initial step (this can be done lazily, or semi-lazily, as needed). - As an intermediate step, parse the DTD section, construct whatever internal state is needed to handle DTD rules, a dictionary of entity references, etc.. - Filter the output of dxml to insert whatever extra behaviour is needed to implement DTD support before handing it to the calling code, e.g., expand entity references, or implement validation and throw an exception if validation fails, etc.. *We don't need to change dxml's current API at all.* At the most, I anticipate that the only potential change needed is to expose an interface to parse XML fragments (i.e., not a complete XML document that contains an outer <xml> tag, but just some PCDATA that may contain entities or tags) so that the DTD support wrapper can use it to expand entities and insert any tags that may appear inside the entity definition. The DTD wrapper doesn't guarantee (and doesn't need to!) to return slices of the input like dxml does. I don't see that as a problem, since I can't see how anyone would be able to implement full DTD support with only slices, even independently from the way dxml is implemented right now. We can even design the DTD support wrapper to start with being just a thin wrapper around dxml, and lazily switch to full DTD mode only if a DTD section is encountered. Then user code that doesn't care to use dxml's raw API won't even need to care about the difference. T -- Curiosity kills the cat. Moral: don't be the cat.
Feb 12 2018
parent Chris <wendlec tcd.ie> writes:
On Monday, 12 February 2018 at 21:51:56 UTC, H. S. Teoh wrote:
[...]
 We can even design the DTD support wrapper to start with being 
 just a thin wrapper around dxml, and lazily switch to full DTD 
 mode only if a DTD section is encountered.  Then user code that 
 doesn't care to use dxml's raw API won't even need to care 
 about the difference.


 T
In this vein, if a new version of std.xml didn't offer pure and fast parsing like dxml, but included DTD by default, people would complain that that was the real deal breaker (too slow, man!). Remember `autodecode`? Right. DTD inclusion should only be available on demand. Imagine you want to implement a library project where ebooks (say classics) are catalogued and presented in an ebook reader on the web (or in an app on your smart phone). It is likely that the whole DTD thing would probably be done at the cataloguing stage, but once the books are in the library most users will probably just want to go through them page by page or search for quotes etc. - and for that you'd need a fast tool like dxml with no overhead.
Feb 13 2018
prev sibling next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 13:51:56 H. S. Teoh via Digitalmars-d-announce 
wrote:
 For example, entity
 support pretty much means plain slices are no longer an option, because
 you have to perform substitution of entity definitions, so you'll have
 to either wrap it in some kind of lazy range that chains the entity
 definition to the surrounding text, or you'l have to use strings or
 something else.  Which means you'll need to have memory allocation /
 slower parsing / whatever, but that's the price of DTD support.
Which was my point. The API as-is doesn't work with DTD support for those very reasons.
 But again, the point is, basic XML parsing (without DTD support) doesn't
 *need* to pay this price. What's currently in dxml doesn't need to
 change. DTD support can be implemented in a submodule / separate module
 that wraps around dxml and builds DTD support on top of it.

 Put another way, we can implement DTD support *on top of* dxml this way:
 - Parse the XML using dxml as an initial step (this can be done lazily,
   or semi-lazily, as needed).
 - As an intermediate step, parse the DTD section, construct whatever
   internal state is needed to handle DTD rules, a dictionary of entity
   references, etc..
 - Filter the output of dxml to insert whatever extra behaviour is needed
   to implement DTD support before handing it to the calling code, e.g.,
   expand entity references, or implement validation and throw an
   exception if validation fails, etc..

 *We don't need to change dxml's current API at all.*
I don't think that this works, because the entity references insert new XML and thus affect the parsing. And as such, you can't simply pass through the entity references to be processed by another parser. They need to be handled by the core parser, otherwise it's going to give incorrect results, not just results that need further parsing. I'm sure that dxml's internals could be refactored so that they could be shared with another parser that did that, but unless I'm misunderstanding how entity references work, you can't use what's there now as-is and build another parser on top of it. The entity reference replacement needs to happen in the core parser.
 The DTD wrapper doesn't guarantee (and doesn't need to!) to return
 slices of the input like dxml does. I don't see that as a problem, since
 I can't see how anyone would be able to implement full DTD support with
 only slices, even independently from the way dxml is implemented right
 now.
Yeah, if I were writing a parser that handled the DTD section, I wouldn't make it deal with slices of the input like DTD does unless I decided to make it always return string, in which case, you could get slices of the original input for strings but no other range types - it's either that or using a lazy range, which would be worse if you passed strings but better for other range types. And that's the main reason that I gave up on having dxml handle the DTD section. I consider that approach unacceptable. One of the key goals for dxml was that it would be providing slices of the input and not lazy ranges or allocating new strings. In any case, unless I misunderstand how entity references work, that would have to be its own parser and not simply a wrapper around dxml because of how the entity references affect the parsing. If I'm wrong, then great, someone else can come along later and add some sort of DTD parser on top of dxml, and if I'm right, well, then anyone who wants to do anything like that is going to need to write a new parser, but that can then coexist alongside dxml's parser just fine. Either way, I like dxml's approach and don't want to compromise what it's doing in an attempt to fully deal with DTDs. - Jonathan M Davis
Feb 12 2018
prev sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, February 13, 2018 14:29:27 H. S. Teoh via Digitalmars-d-announce 
wrote:
 Given the insane complexities of DTD that I'm only slowly beginning to
 grasp from actually reading the spec, I'm quickly adopting the opinion
 that dxml should remain as-is, and any DTD implementation should be
 layered on top.  The only potential changes that might be needed is:

 - provide a way to parse XML snippets that don't have a <?xml ...>
   declaration, so that a DTD implementation could, for example, hand an
   entity body over to dxml to extract any tags that may be nested in
   there (and if my reading of section 4.3.2 is correct, all such tags
   must always be closed inside the entity body, so there should be no
   errors produced).
XML 1.0 does not require the <?xml...?> section - which is the main reason why dxml implements XML 1.0 and not 1.1. When working on one of my projects with std_experimental_xml, I had to keep adding the <?xml...?> declaration to the start of XML snippets in all of my tests which had to deal with sections of an XML document, and it was _really_ annoying. dxml does require that what it's given be a valid XML 1.0 document, which means that you have to have exactly one root element in what it's passed, which does limit which kind of XML snippets you pass it, but it will work for a lot of XML snippets as-is.
 - provide some way of hooking into non-default entities so that
   DTD-defined entities can be expanded by the DTD implementation.  This
   could be as simple as leaving such entities untouched in the returned
   range, or invent a special EntityType representing such entities (with
   a slice of the input containing the entity name) so that the DTD
   implementation can insert the replacement text.
After having actually implemented full parsing for the entire DTD section before figuring out that references could be inserted in it just about anywhere and that the grammar in the spec is only the grammar _after_ all of the replacements were made (when I figured that out was when I gave up on DTD support), I would strongly argue in favor of simply passing along entity references as-is and leaving any and all such processing to a DTD-enabled parser. Originally, the Config had options like SkipDTD and SkipProlog, and I even provided a way to get at the information in the <?xml...?> declaration if you wanted it, all that just wasn't worth the extra complexity. - Jonathan M Davis
Feb 13 2018
prev sibling next sibling parent reply Johannes Loher <johannes.loher fg4f.de> writes:
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:
 dxml 0.2.0 has now been released.
 [...]
Thank you very much for your efforts, I really appreciate it, as I have been looking for a decent xml library for quite some time. Whethr or not this is a candidate for inclusion into phobos is certainly up for debate, but as you already mentioned several times, this thread is hardly the right place for that. So instead I'd like to emphasize how much I appreciate you working on this and I am sure I am not the only one. This absence of a usable high quality xml library is/was a big problem for d in my opinion and it is great to see that this is finally being worked on :)
Feb 12 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, February 12, 2018 21:26:45 Johannes Loher via Digitalmars-d-
announce wrote:
 On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis

 wrote:
 dxml 0.2.0 has now been released.
 [...]
Thank you very much for your efforts, I really appreciate it, as I have been looking for a decent xml library for quite some time. Whethr or not this is a candidate for inclusion into phobos is certainly up for debate, but as you already mentioned several times, this thread is hardly the right place for that. So instead I'd like to emphasize how much I appreciate you working on this and I am sure I am not the only one. This absence of a usable high quality xml library is/was a big problem for d in my opinion and it is great to see that this is finally being worked on :)
Thanks. When you do use it, please give feedback - particularly if you find any problems or pain points. I definitely think that the API is solid overall, but that doesn't mean that I got it completely right, and even with all of the tests that I have, I could have missed something and ended up with a bug in the parser. I'm reasonably confident in the code quality, but that doesn't mean that I didn't miss anything. - Jonathan M Davis
Feb 12 2018
prev sibling parent Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis 
wrote:
 dxml 0.2.0 has now been released.
 Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/
 Github: https://github.com/jmdavis/dxml/tree/v0.2.0
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis
This is absolutely awesome. It is a little low level (compared to SAX) so there is more to deal with, but having this provide a range (and flat) makes it so much clearer the ordering of elements. If I need to handle nesting then I can build that out, but if I don't I can just fly by the seat of my pants and grab the elements I want. This will definitely be my goto for XML parsing.
Feb 23 2018