www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - dxml 0.1.0 released

reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
I have multiple projects that need an XML parser, and std_experimental_xml
is clearly going nowhere, with the guy who wrote it having disappeared into
the ether, so I decided to break down and write one. I've kind of wanted to
for years, but I didn't want to spend the time on it. However, sometime last
year I finally decided that I had to, and it's been what I've been working
on in my free time for a while now. And it's finally reached the point when
it makes sense to release it - hence this post.

Currently, dxml contains only a range-based StAX / pull parser and related
helper functions, but the plan is to add a DOM parser as well as two writers
- one which is the writer equivalent of a StaX parser, and one which is
DOM-based. However, in theory, the StAX parser is complete and quite useable
as-is - though I expect that I'll be adding more helper functions to make it
easier to use, and if you find that you're doing a particular operation with
it frequently and that that operation is overly verbose, please point it out
so that maybe a helper function can be added to improve that use case - e.g.
I'm thinking of adding a function similar to std.getopt.getopt for handling
attributes, because I personally find that dealing with those is more
verbose than I'd like. Obviously, some stuff is just going to do better with
a DOM parser, but thus far, I've found that a StAX parser has suited my
needs quite well. I have no plans to add a SAX parser, since as far as I can
tell, SAX parsers are just plain worse than StAX parsers, and the StAX
approach is quite well-suited to ranges.

Of note, dxml does not support the DTD section beyond what is required to
parse past it, since supporting it would make it impossible for the parser
to return slices of the original input beyond the case where strings are
used (and it would be forced to allocate strings in some cases, whereas dxml
does _very_ minimal heap allocation right now), and parsing the DTD section
signicantly increases the complexity of the parser in order to support
something that I honestly don't think should ever have been part of the XML
standard and is unnecessary for many, many XML documents. So, if you're
dealing with XML documents that contain entity references that are declared
in the DTD section and then used outside of the DTD section, then dxml will
not support them, but it will work just fine if a DTD section is there so
long as it doesn't declare any entity references that are then referenced in
the document proper.

Hopefully, the documentation is clear enough, but obviously, I'm not the
best judge of that. So, have at it.

Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
Github: https://github.com/jmdavis/dxml
Dub: http://code.dlang.org/packages/dxml

- Jonathan M Davis
Feb 09 2018
next sibling parent reply Stefan <dl ng.rocks> writes:
great work, Jonathan. Thank you.
We were missing xml for a long time and did so many hacks just to 
get xml somehow parsed.
Feb 10 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, February 10, 2018 10:27:42 Stefan via Digitalmars-d-announce 
wrote:
 great work, Jonathan. Thank you.
 We were missing xml for a long time and did so many hacks just to
 get xml somehow parsed.
LOL. Actually, one of the helper functions in std.datetime.timezone that has to deal with xml does it via hacks, because the XML in question was fairly simple, and I didn't want to deal with std.xml. If dxml does end up going through the Phobo review process and eventually ends up in Phobos, I'll have to change that code so that it uses dxml instead of the hacks. - Jonathan M Davis
Feb 10 2018
prev sibling next sibling parent reply Seb <seb wilzba.ch> writes:
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:
 I have multiple projects that need an XML parser, and 
 std_experimental_xml is clearly going nowhere, with the guy who 
 wrote it having disappeared into the ether, so I decided to 
 break down and write one. I've kind of wanted to for years, but 
 I didn't want to spend the time on it. However, sometime last 
 year I finally decided that I had to, and it's been what I've 
 been working on in my free time for a while now. And it's 
 finally reached the point when it makes sense to release it - 
 hence this post.

 [...]
FWIW we recently forked the experimental.xml repo to dlang-community: https://github.com/dlang-community/experimental.xml So PRs etc can be merged easily. But yeah it's not moving anywhere atm :/
Feb 10 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, February 10, 2018 12:04:48 Seb via Digitalmars-d-announce 
wrote:
 On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis

 wrote:
 I have multiple projects that need an XML parser, and
 std_experimental_xml is clearly going nowhere, with the guy who
 wrote it having disappeared into the ether, so I decided to
 break down and write one. I've kind of wanted to for years, but
 I didn't want to spend the time on it. However, sometime last
 year I finally decided that I had to, and it's been what I've
 been working on in my free time for a while now. And it's
 finally reached the point when it makes sense to release it -
 hence this post.

 [...]
FWIW we recently forked the experimental.xml repo to dlang-community: https://github.com/dlang-community/experimental.xml So PRs etc can be merged easily. But yeah it's not moving anywhere atm :/
Yeah, I got some e-mails about that the other day, since I had some open issues and PRs on it, and IIRC github was telling me that you'd migrated some of that over, but unless someone decides that they want to take up the torch on it, it seems pretty dead. I assume that the guy who did it simply got too busy with school once GSoC ended and then never got back to it even when he did have time. If he were serious about finishing it and being an active part of the D community, he would have at least looked at some the PRs on the project, but he's been completely silent for quite a while now. So, I guess he moved on. I was able to use it on one of my projects by making some local changes and by working around some bugs, but it clearly needs work that it's not getting. I had some rather specific ideas about what I wanted to do with an XML parser though and didn't want to spend the time trying to decipher what he'd done and morph it into something more like what I wanted, so I just started from scratch. - Jonathan M Davis
Feb 10 2018
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2018-02-09 22:15, Jonathan M Davis wrote:

 Currently, dxml contains only a range-based StAX / pull parser and related
 helper functions, but the plan is to add a DOM parser as well as two writers
 - one which is the writer equivalent of a StaX parser, and one which is
 DOM-based. However, in theory, the StAX parser is complete and quite useable
 as-is - though I expect that I'll be adding more helper functions to make it
 easier to use, and if you find that you're doing a particular operation with
 it frequently and that that operation is overly verbose, please point it out
 so that maybe a helper function can be added to improve that use case - e.g.
This is great news! Have you run any benchmarks to see how it performs? -- /Jacob Carlborg
Feb 10 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via Digitalmars-d-
announce wrote:
 On 2018-02-09 22:15, Jonathan M Davis wrote:
 Currently, dxml contains only a range-based StAX / pull parser and
 related helper functions, but the plan is to add a DOM parser as well
 as two writers - one which is the writer equivalent of a StaX parser,
 and one which is DOM-based. However, in theory, the StAX parser is
 complete and quite useable as-is - though I expect that I'll be adding
 more helper functions to make it easier to use, and if you find that
 you're doing a particular operation with it frequently and that that
 operation is overly verbose, please point it out so that maybe a helper
 function can be added to improve that use case - e.g.
This is great news! Have you run any benchmarks to see how it performs?
Kind of. I did some benchmarking to see if some code changes would improve performance, but I haven't tried benchmarking it against any other XML libraries. That would take a fair bit of time and effort, and IMHO, that would be better spent finishing the library first. Also, ldc's latest release is only up to dmd 2.077.1, and dxml needs an improvement that got added to byCodeUnit in 2.078.0, so any benchmarking that wants to do something like compare dxml with a C/C++ parsing library while taking the optimizer out of the equation isn't going to work yet unless I fork byCodeUnit for dxml until we get another release of ldc. One result of the benchmarking that I did do allowed me to simplify the code quite a bit though. I'd originally had it be configurable whether the parser kept track of the line number and column of the document, just the line number, or neither on the theory that I really wanted access to the position in the document in error messages but that it would affect performance, so it should be configurable. However, benchmarking showed that it had negligible impact on performance to the point that different PositionTypes won out depending on the file and the particular run of the program, indicating that that extra complexity was buying me nothing. There were a fair number of static ifs to deal with that configuration option, so as soon as I was able to measure that they didn't matter particularly, I removed that option from the Config and all of its associated static ifs in the parser and was able to reduce the complexity of the code a fair bit. Testing that bit was actually the main reason that I did any benchmarking before releasing anything, since I wanted to avoid changing the API later if I could. I am going to need to spend more time benchmarking code changes at some point here though to see if I can make the parser faster, and eventually, I will probably benchmark it against other parsing libraries. I fully expect that it will compare favorably given that it does almost no heap allocations and slices everything, but there's every possibility that I did something algorithmically internally that hurts performance more than it should - e.g. while it tries to parse everything only once, there are a few places where it ends up taking a second pass over a piece of text, and refactoring that is on my todo list (though most of the other potential improvements I did benchmark were a wash, so I may find that it doesn't matter much). I'll probably be in more of a hurry to benchmark dxml against other parsing libraries if my dconf talk proposal on it gets accepted, since that's the sort of thing that should probably be in such a talk. I haven't even taken the time yet to figure out which libraries it should be benchmared against. - Jonathan M Davis
Feb 10 2018
next sibling parent reply Joakim <dlang joakim.fea.st> writes:
On Saturday, 10 February 2018 at 18:57:53 UTC, Jonathan M Davis 
wrote:
 On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via 
 Digitalmars-d- announce wrote:
 On 2018-02-09 22:15, Jonathan M Davis wrote:
 [...]
This is great news! Have you run any benchmarks to see how it performs?
Kind of. I did some benchmarking to see if some code changes would improve performance, but I haven't tried benchmarking it against any other XML libraries. That would take a fair bit of time and effort, and IMHO, that would be better spent finishing the library first. Also, ldc's latest release is only up to dmd 2.077.1, and dxml needs an improvement that got added to byCodeUnit in 2.078.0, so any benchmarking that wants to do something like compare dxml with a C/C++ parsing library while taking the optimizer out of the equation isn't going to work yet unless I fork byCodeUnit for dxml until we get another release of ldc.
ldc master uses the latest 2.078.2 frontend and stdlib, you could always build it yourself: https://github.com/ldc-developers/ldc/blob/master/CMakeLists.txt#L54 https://wiki.dlang.org/Building_LDC_from_source
Feb 10 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, February 10, 2018 21:10:28 Joakim via Digitalmars-d-announce 
wrote:
 On Saturday, 10 February 2018 at 18:57:53 UTC, Jonathan M Davis

 wrote:
 On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via

 Digitalmars-d- announce wrote:
 On 2018-02-09 22:15, Jonathan M Davis wrote:
 [...]
This is great news! Have you run any benchmarks to see how it performs?
Kind of. I did some benchmarking to see if some code changes would improve performance, but I haven't tried benchmarking it against any other XML libraries. That would take a fair bit of time and effort, and IMHO, that would be better spent finishing the library first. Also, ldc's latest release is only up to dmd 2.077.1, and dxml needs an improvement that got added to byCodeUnit in 2.078.0, so any benchmarking that wants to do something like compare dxml with a C/C++ parsing library while taking the optimizer out of the equation isn't going to work yet unless I fork byCodeUnit for dxml until we get another release of ldc.
ldc master uses the latest 2.078.2 frontend and stdlib, you could always build it yourself: https://github.com/ldc-developers/ldc/blob/master/CMakeLists.txt#L54 https://wiki.dlang.org/Building_LDC_from_source
That's good to know. Thanks. If I get to the point where I want to do more benchmarking before ldc does another release, I'll build it myself, though depending on when I reach that point and when ldc plans to do another release, it may or may not end up being necessary. - Jonathan M Davis
Feb 10 2018
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2018-02-10 19:57, Jonathan M Davis wrote:

 Kind of. I did some benchmarking to see if some code changes would improve
 performance, but I haven't tried benchmarking it against any other XML
 libraries.
Ok, I see.
 That would take a fair bit of time and effort, and IMHO, that
 would be better spent finishing the library first.
Fair enough. -- /Jacob Carlborg
Feb 11 2018
prev sibling next sibling parent bauss <jj_1337 live.dk> writes:
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:
 I have multiple projects that need an XML parser, and 
 std_experimental_xml is clearly going nowhere, with the guy who 
 wrote it having disappeared into the ether, so I decided to 
 break down and write one. I've kind of wanted to for years, but 
 I didn't want to spend the time on it. However, sometime last 
 year I finally decided that I had to, and it's been what I've 
 been working on in my free time for a while now. And it's 
 finally reached the point when it makes sense to release it - 
 hence this post.

 Currently, dxml contains only a range-based StAX / pull parser 
 and related helper functions, but the plan is to add a DOM 
 parser as well as two writers - one which is the writer 
 equivalent of a StaX parser, and one which is DOM-based. 
 However, in theory, the StAX parser is complete and quite 
 useable as-is - though I expect that I'll be adding more helper 
 functions to make it easier to use, and if you find that you're 
 doing a particular operation with it frequently and that that 
 operation is overly verbose, please point it out so that maybe 
 a helper function can be added to improve that use case - e.g. 
 I'm thinking of adding a function similar to std.getopt.getopt 
 for handling attributes, because I personally find that dealing 
 with those is more verbose than I'd like. Obviously, some stuff 
 is just going to do better with a DOM parser, but thus far, 
 I've found that a StAX parser has suited my needs quite well. I 
 have no plans to add a SAX parser, since as far as I can tell, 
 SAX parsers are just plain worse than StAX parsers, and the 
 StAX approach is quite well-suited to ranges.

 Of note, dxml does not support the DTD section beyond what is 
 required to parse past it, since supporting it would make it 
 impossible for the parser to return slices of the original 
 input beyond the case where strings are used (and it would be 
 forced to allocate strings in some cases, whereas dxml does 
 _very_ minimal heap allocation right now), and parsing the DTD 
 section signicantly increases the complexity of the parser in 
 order to support something that I honestly don't think should 
 ever have been part of the XML standard and is unnecessary for 
 many, many XML documents. So, if you're dealing with XML 
 documents that contain entity references that are declared in 
 the DTD section and then used outside of the DTD section, then 
 dxml will not support them, but it will work just fine if a DTD 
 section is there so long as it doesn't declare any entity 
 references that are then referenced in the document proper.

 Hopefully, the documentation is clear enough, but obviously, 
 I'm not the best judge of that. So, have at it.

 Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
 Github: https://github.com/jmdavis/dxml
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis
This is going to be really useful for people like me who works with webservices using soap. Thanks for the great work.
Feb 10 2018
prev sibling next sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:

 Hopefully, the documentation is clear enough, but obviously, 
 I'm not the best judge of that. So, have at it.

 Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
 Github: https://github.com/jmdavis/dxml
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis
This looks so nice. I can understand the concerns of the DTD, and it doesn't look like you needed to do anything special for namespaces with this parser.
Feb 10 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, February 10, 2018 19:53:48 Jesse Phillips via Digitalmars-d-
announce wrote:
 On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis

 wrote:
 Hopefully, the documentation is clear enough, but obviously,
 I'm not the best judge of that. So, have at it.

 Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
 Github: https://github.com/jmdavis/dxml
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis
This looks so nice. I can understand the concerns of the DTD, and it doesn't look like you needed to do anything special for namespaces with this parser.
I confess that I haven't looked into namespaces in detail, but from what I understand about them, I don't see any reason to do anything beyond treating them as part of the name. If the application wants to do something special with them, then it's free to do so. Key goals of this parser were to make it fast and simple to use for the typical use case. As much as possible, I'd like to keep the complicated stuff out of it. Personally, I see XML only as data just like JSON is only data, and I think that the complications in the XML spec come from trying to treat it as more than that. I had originally intended to provide at least minimal DTD support but leave most of it to some kind of helper functionality (e.g. have a helper function which took the DTD data and then validated the rest of the XML using it). However, as I got farther along, it became clear that that wasn't going to work without giving up on being able to just slice the input, and I wasn't willing to give up on that, especially when I don't see handling the DTD as valuable for anything but dealing with overly complicated XML that is outside of the programmer's control or to simply be able to say that I completely implemented the XML spec. Slicing is part of why parsers written in D should tend to be inherently fast in comparison to those written in languages like C++, and I want to take advantage of that. In principle, something like an XML parser should be able to be a showcase for why D is great. Tango's was, but Phobos' hasn't been, and I'd like for dxml to be able to be that regardless of whether it eventually replaces std.xml or not. - Jonathan M Davis
Feb 10 2018
prev sibling parent Cym13 <cpicard openmailbox.org> writes:
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:
 [...]
 Of note, dxml does not support the DTD section beyond what is 
 required to parse past it
 [...]
 - Jonathan M Davis
Fun fact, since the most common security vulnerability associated with XML (XEE [1]) is based on exploiting the fact that most libraries parse in-line DTDs by default, this makes dxml immune to such attacks. Given how often this vulnerability is found in the wild it sounds like a very good thing to me :D [1]: https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Processing
Feb 11 2018