digitalmars.D.announce - dxml 0.1.0 released

Jonathan M Davis (41/41) Feb 09 2018 I have multiple projects that need an XML parser, and std_experimental_x...

Stefan (3/3) Feb 10 2018 great work, Jonathan. Thank you.

Jonathan M Davis (9/12) Feb 10 2018 LOL. Actually, one of the helper functions in std.datetime.timezone that...

Seb (7/17) Feb 10 2018 FWIW we recently forked the experimental.xml repo to

Jonathan M Davis (18/36) Feb 10 2018 Yeah, I got some e-mails about that the other day, since I had some open

Jacob Carlborg (4/12) Feb 10 2018 This is great news! Have you run any benchmarks to see how it performs?

Jonathan M Davis (43/54) Feb 10 2018 Kind of. I did some benchmarking to see if some code changes would impro...

Joakim (6/23) Feb 10 2018 ldc master uses the latest 2.078.2 frontend and stdlib, you could

Jonathan M Davis (8/34) Feb 10 2018 That's good to know. Thanks.

Jacob Carlborg (5/10) Feb 11 2018 Fair enough.

bauss (5/53) Feb 10 2018 This is going to be really useful for people like me who works
Jesse Phillips (6/12) Feb 10 2018 This looks so nice.

Jonathan M Davis (27/41) Feb 10 2018 I confess that I haven't looked into namespaces in detail, but from what...

Cym13 (9/14) Feb 11 2018 Fun fact, since the most common security vulnerability associated

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

I have multiple projects that need an XML parser, and std_experimental_xml
is clearly going nowhere, with the guy who wrote it having disappeared into
the ether, so I decided to break down and write one. I've kind of wanted to
for years, but I didn't want to spend the time on it. However, sometime last
year I finally decided that I had to, and it's been what I've been working
on in my free time for a while now. And it's finally reached the point when
it makes sense to release it - hence this post.

Currently, dxml contains only a range-based StAX / pull parser and related
helper functions, but the plan is to add a DOM parser as well as two writers
- one which is the writer equivalent of a StaX parser, and one which is
DOM-based. However, in theory, the StAX parser is complete and quite useable
as-is - though I expect that I'll be adding more helper functions to make it
easier to use, and if you find that you're doing a particular operation with
it frequently and that that operation is overly verbose, please point it out
so that maybe a helper function can be added to improve that use case - e.g.
I'm thinking of adding a function similar to std.getopt.getopt for handling
attributes, because I personally find that dealing with those is more
verbose than I'd like. Obviously, some stuff is just going to do better with
a DOM parser, but thus far, I've found that a StAX parser has suited my
needs quite well. I have no plans to add a SAX parser, since as far as I can
tell, SAX parsers are just plain worse than StAX parsers, and the StAX
approach is quite well-suited to ranges.

Of note, dxml does not support the DTD section beyond what is required to
parse past it, since supporting it would make it impossible for the parser
to return slices of the original input beyond the case where strings are
used (and it would be forced to allocate strings in some cases, whereas dxml
does _very_ minimal heap allocation right now), and parsing the DTD section
signicantly increases the complexity of the parser in order to support
something that I honestly don't think should ever have been part of the XML
standard and is unnecessary for many, many XML documents. So, if you're
dealing with XML documents that contain entity references that are declared
in the DTD section and then used outside of the DTD section, then dxml will
not support them, but it will work just fine if a DTD section is there so
long as it doesn't declare any entity references that are then referenced in
the document proper.

Hopefully, the documentation is clear enough, but obviously, I'm not the
best judge of that. So, have at it.

Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
Github: https://github.com/jmdavis/dxml
Dub: http://code.dlang.org/packages/dxml

- Jonathan M Davis

Feb 09 2018

Stefan <dl ng.rocks> writes:

great work, Jonathan. Thank you.
We were missing xml for a long time and did so many hacks just to 
get xml somehow parsed.

Feb 10 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, February 10, 2018 10:27:42 Stefan via Digitalmars-d-announce 
wrote:
 great work, Jonathan. Thank you.
 We were missing xml for a long time and did so many hacks just to
 get xml somehow parsed.

LOL. Actually, one of the helper functions in std.datetime.timezone that has
to deal with xml does it via hacks, because the XML in question was fairly
simple, and I didn't want to deal with std.xml.

If dxml does end up going through the Phobo review process and eventually
ends up in Phobos, I'll have to change that code so that it uses dxml
instead of the hacks.

- Jonathan M Davis

Feb 10 2018

Seb <seb wilzba.ch> writes:

On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:
 I have multiple projects that need an XML parser, and 
 std_experimental_xml is clearly going nowhere, with the guy who 
 wrote it having disappeared into the ether, so I decided to 
 break down and write one. I've kind of wanted to for years, but 
 I didn't want to spend the time on it. However, sometime last 
 year I finally decided that I had to, and it's been what I've 
 been working on in my free time for a while now. And it's 
 finally reached the point when it makes sense to release it - 
 hence this post.

 [...]

FWIW we recently forked the experimental.xml repo to 
dlang-community:

https://github.com/dlang-community/experimental.xml

So PRs etc can be merged easily.
But yeah it's not moving anywhere atm :/

Feb 10 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, February 10, 2018 12:04:48 Seb via Digitalmars-d-announce 
wrote:
 On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis

 wrote:
 I have multiple projects that need an XML parser, and
 std_experimental_xml is clearly going nowhere, with the guy who
 wrote it having disappeared into the ether, so I decided to
 break down and write one. I've kind of wanted to for years, but
 I didn't want to spend the time on it. However, sometime last
 year I finally decided that I had to, and it's been what I've
 been working on in my free time for a while now. And it's
 finally reached the point when it makes sense to release it -
 hence this post.

 [...]

 FWIW we recently forked the experimental.xml repo to
 dlang-community:

 https://github.com/dlang-community/experimental.xml

 So PRs etc can be merged easily.
 But yeah it's not moving anywhere atm :/

Yeah, I got some e-mails about that the other day, since I had some open
issues and PRs on it, and IIRC github was telling me that you'd migrated
some of that over, but unless someone decides that they want to take up the
torch on it, it seems pretty dead. I assume that the guy who did it simply
got too busy with school once GSoC ended and then never got back to it even
when he did have time. If he were serious about finishing it and being an
active part of the D community, he would have at least looked at some the
PRs on the project, but he's been completely silent for quite a while now.
So, I guess he moved on. I was able to use it on one of my projects by
making some local changes and by working around some bugs, but it clearly
needs work that it's not getting.

I had some rather specific ideas about what I wanted to do with an XML
parser though and didn't want to spend the time trying to decipher what he'd
done and morph it into something more like what I wanted, so I just started
from scratch.

- Jonathan M Davis

Feb 10 2018

Jacob Carlborg <doob me.com> writes:

On 2018-02-09 22:15, Jonathan M Davis wrote:

 Currently, dxml contains only a range-based StAX / pull parser and related
 helper functions, but the plan is to add a DOM parser as well as two writers
 - one which is the writer equivalent of a StaX parser, and one which is
 DOM-based. However, in theory, the StAX parser is complete and quite useable
 as-is - though I expect that I'll be adding more helper functions to make it
 easier to use, and if you find that you're doing a particular operation with
 it frequently and that that operation is overly verbose, please point it out
 so that maybe a helper function can be added to improve that use case - e.g.

This is great news! Have you run any benchmarks to see how it performs?

-- 
/Jacob Carlborg

Feb 10 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via Digitalmars-d-
announce wrote:
 On 2018-02-09 22:15, Jonathan M Davis wrote:
 Currently, dxml contains only a range-based StAX / pull parser and
 related helper functions, but the plan is to add a DOM parser as well
 as two writers - one which is the writer equivalent of a StaX parser,
 and one which is DOM-based. However, in theory, the StAX parser is
 complete and quite useable as-is - though I expect that I'll be adding
 more helper functions to make it easier to use, and if you find that
 you're doing a particular operation with it frequently and that that
 operation is overly verbose, please point it out so that maybe a helper
 function can be added to improve that use case - e.g.

 This is great news! Have you run any benchmarks to see how it performs?

Kind of. I did some benchmarking to see if some code changes would improve
performance, but I haven't tried benchmarking it against any other XML
libraries. That would take a fair bit of time and effort, and IMHO, that
would be better spent finishing the library first. Also, ldc's latest
release is only up to dmd 2.077.1, and dxml needs an improvement that got
added to byCodeUnit in 2.078.0, so any benchmarking that wants to do
something like compare dxml with a C/C++ parsing library while taking the
optimizer out of the equation isn't going to work yet unless I fork
byCodeUnit for dxml until we get another release of ldc.

One result of the benchmarking that I did do allowed me to simplify the code
quite a bit though. I'd originally had it be configurable whether the parser
kept track of the line number and column of the document, just the line
number, or neither on the theory that I really wanted access to the position
in the document in error messages but that it would affect performance, so
it should be configurable. However, benchmarking showed that it had
negligible impact on performance to the point that different PositionTypes
won out depending on the file and the particular run of the program,
indicating that that extra complexity was buying me nothing. There were a
fair number of static ifs to deal with that configuration option, so as soon
as I was able to measure that they didn't matter particularly, I removed
that option from the Config and all of its associated static ifs in the
parser and was able to reduce the complexity of the code a fair bit. Testing
that bit was actually the main reason that I did any benchmarking before
releasing anything, since I wanted to avoid changing the API later if I
could.

I am going to need to spend more time benchmarking code changes at some
point here though to see if I can make the parser faster, and eventually, I
will probably benchmark it against other parsing libraries. I fully expect
that it will compare favorably given that it does almost no heap allocations
and slices everything, but there's every possibility that I did something
algorithmically internally that hurts performance more than it should - e.g.
while it tries to parse everything only once, there are a few places where
it ends up taking a second pass over a piece of text, and refactoring that
is on my todo list (though most of the other potential improvements I did
benchmark were a wash, so I may find that it doesn't matter much).

I'll probably be in more of a hurry to benchmark dxml against other parsing
libraries if my dconf talk proposal on it gets accepted, since that's the
sort of thing that should probably be in such a talk.

I haven't even taken the time yet to figure out which libraries it should be
benchmared against.

- Jonathan M Davis

Feb 10 2018

Joakim <dlang joakim.fea.st> writes:

On Saturday, 10 February 2018 at 18:57:53 UTC, Jonathan M Davis 
wrote:
 On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via 
 Digitalmars-d- announce wrote:
 On 2018-02-09 22:15, Jonathan M Davis wrote:
 [...]

 This is great news! Have you run any benchmarks to see how it 
 performs?

 Kind of. I did some benchmarking to see if some code changes 
 would improve performance, but I haven't tried benchmarking it 
 against any other XML libraries. That would take a fair bit of 
 time and effort, and IMHO, that would be better spent finishing 
 the library first. Also, ldc's latest release is only up to dmd 
 2.077.1, and dxml needs an improvement that got added to 
 byCodeUnit in 2.078.0, so any benchmarking that wants to do 
 something like compare dxml with a C/C++ parsing library while 
 taking the optimizer out of the equation isn't going to work 
 yet unless I fork byCodeUnit for dxml until we get another 
 release of ldc.

ldc master uses the latest 2.078.2 frontend and stdlib, you could 
always build it yourself:

https://github.com/ldc-developers/ldc/blob/master/CMakeLists.txt#L54
https://wiki.dlang.org/Building_LDC_from_source

Feb 10 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, February 10, 2018 21:10:28 Joakim via Digitalmars-d-announce 
wrote:
 On Saturday, 10 February 2018 at 18:57:53 UTC, Jonathan M Davis

 wrote:
 On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via

 Digitalmars-d- announce wrote:
 On 2018-02-09 22:15, Jonathan M Davis wrote:
 [...]

 This is great news! Have you run any benchmarks to see how it
 performs?

 Kind of. I did some benchmarking to see if some code changes
 would improve performance, but I haven't tried benchmarking it
 against any other XML libraries. That would take a fair bit of
 time and effort, and IMHO, that would be better spent finishing
 the library first. Also, ldc's latest release is only up to dmd
 2.077.1, and dxml needs an improvement that got added to
 byCodeUnit in 2.078.0, so any benchmarking that wants to do
 something like compare dxml with a C/C++ parsing library while
 taking the optimizer out of the equation isn't going to work
 yet unless I fork byCodeUnit for dxml until we get another
 release of ldc.

 ldc master uses the latest 2.078.2 frontend and stdlib, you could
 always build it yourself:

 https://github.com/ldc-developers/ldc/blob/master/CMakeLists.txt#L54
 https://wiki.dlang.org/Building_LDC_from_source

That's good to know. Thanks.

If I get to the point where I want to do more benchmarking before ldc does
another release, I'll build it myself, though depending on when I reach that
point and when ldc plans to do another release, it may or may not end up
being necessary.

- Jonathan M Davis

Feb 10 2018

Jacob Carlborg <doob me.com> writes:

On 2018-02-10 19:57, Jonathan M Davis wrote:

 Kind of. I did some benchmarking to see if some code changes would improve
 performance, but I haven't tried benchmarking it against any other XML
 libraries.

Ok, I see.

 That would take a fair bit of time and effort, and IMHO, that
 would be better spent finishing the library first.

Fair enough.

-- 
/Jacob Carlborg

Feb 11 2018

bauss <jj_1337 live.dk> writes:

On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:
 I have multiple projects that need an XML parser, and 
 std_experimental_xml is clearly going nowhere, with the guy who 
 wrote it having disappeared into the ether, so I decided to 
 break down and write one. I've kind of wanted to for years, but 
 I didn't want to spend the time on it. However, sometime last 
 year I finally decided that I had to, and it's been what I've 
 been working on in my free time for a while now. And it's 
 finally reached the point when it makes sense to release it - 
 hence this post.

 Currently, dxml contains only a range-based StAX / pull parser 
 and related helper functions, but the plan is to add a DOM 
 parser as well as two writers - one which is the writer 
 equivalent of a StaX parser, and one which is DOM-based. 
 However, in theory, the StAX parser is complete and quite 
 useable as-is - though I expect that I'll be adding more helper 
 functions to make it easier to use, and if you find that you're 
 doing a particular operation with it frequently and that that 
 operation is overly verbose, please point it out so that maybe 
 a helper function can be added to improve that use case - e.g. 
 I'm thinking of adding a function similar to std.getopt.getopt 
 for handling attributes, because I personally find that dealing 
 with those is more verbose than I'd like. Obviously, some stuff 
 is just going to do better with a DOM parser, but thus far, 
 I've found that a StAX parser has suited my needs quite well. I 
 have no plans to add a SAX parser, since as far as I can tell, 
 SAX parsers are just plain worse than StAX parsers, and the 
 StAX approach is quite well-suited to ranges.

 Of note, dxml does not support the DTD section beyond what is 
 required to parse past it, since supporting it would make it 
 impossible for the parser to return slices of the original 
 input beyond the case where strings are used (and it would be 
 forced to allocate strings in some cases, whereas dxml does 
 _very_ minimal heap allocation right now), and parsing the DTD 
 section signicantly increases the complexity of the parser in 
 order to support something that I honestly don't think should 
 ever have been part of the XML standard and is unnecessary for 
 many, many XML documents. So, if you're dealing with XML 
 documents that contain entity references that are declared in 
 the DTD section and then used outside of the DTD section, then 
 dxml will not support them, but it will work just fine if a DTD 
 section is there so long as it doesn't declare any entity 
 references that are then referenced in the document proper.

 Hopefully, the documentation is clear enough, but obviously, 
 I'm not the best judge of that. So, have at it.

 Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
 Github: https://github.com/jmdavis/dxml
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis

This is going to be really useful for people like me who works 
with webservices using soap.

Thanks for the great work.

Feb 10 2018

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:

 Hopefully, the documentation is clear enough, but obviously, 
 I'm not the best judge of that. So, have at it.

 Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
 Github: https://github.com/jmdavis/dxml
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis

This looks so nice.

I can understand the concerns of the DTD, and it doesn't look 
like you needed to do anything special for namespaces with this 
parser.

Feb 10 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, February 10, 2018 19:53:48 Jesse Phillips via Digitalmars-d-
announce wrote:
 On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis

 wrote:
 Hopefully, the documentation is clear enough, but obviously,
 I'm not the best judge of that. So, have at it.

 Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
 Github: https://github.com/jmdavis/dxml
 Dub: http://code.dlang.org/packages/dxml

 - Jonathan M Davis

 This looks so nice.

 I can understand the concerns of the DTD, and it doesn't look
 like you needed to do anything special for namespaces with this
 parser.

I confess that I haven't looked into namespaces in detail, but from what I
understand about them, I don't see any reason to do anything beyond treating
them as part of the name. If the application wants to do something special
with them, then it's free to do so. Key goals of this parser were to make it
fast and simple to use for the typical use case. As much as possible, I'd
like to keep the complicated stuff out of it.

Personally, I see XML only as data just like JSON is only data, and I think
that the complications in the XML spec come from trying to treat it as more
than that.

I had originally intended to provide at least minimal DTD support but leave
most of it to some kind of helper functionality (e.g. have a helper function
which took the DTD data and then validated the rest of the XML using it).
However, as I got farther along, it became clear that that wasn't going to
work without giving up on being able to just slice the input, and I wasn't
willing to give up on that, especially when I don't see handling the DTD as
valuable for anything but dealing with overly complicated XML that is
outside of the programmer's control or to simply be able to say that I
completely implemented the XML spec.

Slicing is part of why parsers written in D should tend to be inherently
fast in comparison to those written in languages like C++, and I want to
take advantage of that. In principle, something like an XML parser should be
able to be a showcase for why D is great. Tango's was, but Phobos' hasn't
been, and I'd like for dxml to be able to be that regardless of whether it
eventually replaces std.xml or not.

- Jonathan M Davis

Feb 10 2018

Cym13 <cpicard openmailbox.org> writes:

On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis 
wrote:
 [...]
 Of note, dxml does not support the DTD section beyond what is 
 required to parse past it
 [...]
 - Jonathan M Davis

Fun fact, since the most common security vulnerability associated 
with XML (XEE [1]) is based on exploiting the fact that most 
libraries parse in-line DTDs by default, this makes dxml immune 
to such attacks. Given how often this vulnerability is found in 
the wild it sounds like a very good thing to me :D

[1]: 
https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Processing

Feb 11 2018

D Programming

C/C++ Programming

Other

digitalmars.D.announce - dxml 0.1.0 released