www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Replacing std.xml

reply "w0rp" <devw0rp gmail.com> writes:
Hello everybody. I've been wondering, what are the current plans 
to replace std.xml? I'd like to help with the effort to get a 
final XML library in phobos. So, I have a few questions.

First, and most importantly, what do we except out of a D XML 
library? I'd really like to have a discussion of the form, "Here 
is exactly the interface the structs/classes need to implement, 
go forth and implement." The general idea in my mind is 
"something SAX-like, with something a little DOM-like." I'm aware 
that std.xml has some issues support different encodings, so 
obvious that's included.

Second, is there an existing library that has gotten close to 
meeting whatever we need for the first point? If so, how far away 
is it from being able to meet all of the requirements and become 
the standard library version?
Aug 29 2013
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 29, 2013 09:25:35 w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans
 to replace std.xml? I'd like to help with the effort to get a
 final XML library in phobos. So, I have a few questions.

Someone needs to step forward, write it, and get it through the review process. A while back, someone was working on a possible new version of std.xml, but they disappeared. No one has stepped up since. I'd love to do it if I had time, but I don't. There are probably several others around here in the same boat, but until someone who has the time and skill does do it, we won't have a new std.xml.
 First, and most importantly, what do we except out of a D XML
 library? I'd really like to have a discussion of the form, "Here
 is exactly the interface the structs/classes need to implement,
 go forth and implement." 

Except that that's really the task of the person creating the new std.xml. Generally what happens is that the person writing the module comes up with an API and then presents it rather than asking others to come up with ideas to design it for them. Obviously, ideas can be discussed, but design-by-committee is arguably a bad idea. And it just works better to have a concrete design to discuss.
 The general idea in my mind is
 "something SAX-like, with something a little DOM-like."

What I personally think would be best is to have multiple parsers. First you have something STAX-like (or maybe even lower level - I don't recall exactly what STAX gives you at the moment) that basically tokenizes the XML and returns a range of that. Then SAX and DOM parsers can be built on top of that. That way, you get the fastest parser possible as well as higher level, more functional parsers. But two of the biggest points of the design are that it's going to have to be range-based, and it's going to need to be able to take full advantage of slices (when used with any strings or random-access ranges) in order to avoid copying any of the data. That's the key design point which will allow a D parser to be extremely fast in comparison to parsers in most other languages.
 I'm aware
 that std.xml has some issues support different encodings, so
 obvious that's included.

Personally, I would have just said use ranges of dchar and be done with it without worrying about character encodings at all, but I don't remember what all the XML standard does with encodings.
 Second, is there an existing library that has gotten close to
 meeting whatever we need for the first point? If so, how far away
 is it from being able to meet all of the requirements and become
 the standard library version?

There are several D XML libraries floating around, but no one has taken the time to get any of the prepared for the Phobos review queue, and I suspect that very few of them are range-based like the Phobos XML solution needs to be, but I don't know. - Jonathan M Davis
Aug 29 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-29 09:47, Jonathan M Davis wrote:

 Personally, I would have just said use ranges of dchar and be done with it
 without worrying about character encodings at all, but I don't remember what
 all the XML standard does with encodings.

Won't that have the same problem as we talked about in of the threads about a D lexer? That is, doing unnecessary en/decoding. -- /Jacob Carlborg
Aug 29 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-08-29 11:23, Jonathan M Davis wrote:

 IIRC, everything in XML is
 ASCII anyway, with stuff like HTML codes to indicate Unicode characters. And if
 that's the case, avoiding unnecessary decoding is trivial when operating on
 strings.

What! I hardly believe that. That might be the case for HTML but I don't think it is for XML. There are many file formats that are based on XML. I don't think all those use HTML codes. This is what W3 Schools says: "XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. To avoid errors, specify the XML encoding, or save XML files as Unicode.". -- /Jacob Carlborg
Aug 29 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-29 16:07, Chris wrote:

 And while we're at it, what about YAML? It's a subset of JSON which
 means the new json.d module will handle it, I suppose.

YAML is a super set of JSON, not the other way around. But yes, I would like to have YAML support as well. -- /Jacob Carlborg
Aug 29 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-29 19:38, Jonathan M Davis wrote:

 However, because all of the XML special symbols should be ASCII, you should
 still be able to avoid decoding characters for the most part. It's only when
 you have to actually look at the content that Unicode would potentially
 matter. So, the performance hit of decoding Unicode characters should mostly
 be able to be avoided.

I don't understand. If use a range of dchar and call "front" and "popFront" won't it do decoding then? -- /Jacob Carlborg
Aug 29 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-29 20:57, H. S. Teoh wrote:

 XML files can have *any* valid encoding, including nastiness like
 windows-1252 and relics like iso-8859-1.

Actually, does the encoding really matters (as long as it's compatible with ASCII). Just use a range of ubytes, the parser will only be looking for characters in the ASCII table anyway. -- /Jacob Carlborg
Aug 29 2013
prev sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-08-29 17:38:23 +0000, "Jonathan M Davis" <jmdavisProg gmx.com> said:

 Well, as I said, I couldn't remember exactly what the XML standard said about
 encodings, but if it can contain non-ASCII characters, then my first
 inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the
 fact that that's what we support in the language and in Phobos (as I
 understand it, std.encodings is a bit of a joke that needs to be rethought and
 replaced, but regardless, it's the only Phobos module supporting any non-
 Unicode encodings).

The XML standard says that an XML parser MUST support UTF-8 and UTF-16, and MAY support other encodings. Supporting non-UTF-8 encodings is a separate problem from parsing XML, and proper code for that would have much broader applications. Keep in mind that the more encoding you support, the more bloat you add to the executable, so there's a tradeoff to be made. In many cases, UTF-8 is enough, while in many others it's not. (My XML implementation has a function that parses the XML prolog and tells you the encoding so you can take the appropriate code path before feeding the parser. A higher level API could handle encodings automatically based on that that. )
 However, because all of the XML special symbols should be ASCII, you should
 still be able to avoid decoding characters for the most part. It's only when
 you have to actually look at the content that Unicode would potentially
 matter. So, the performance hit of decoding Unicode characters should mostly
 be able to be avoided.

Just like my XML implementation does. (I made frontUnit/popFrontUnit functions I'm using when decoding code points is unnecessary.) -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Aug 29 2013
prev sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-08-29 07:47:17 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:

 On Thursday, August 29, 2013 09:25:35 w0rp wrote:
 The general idea in my mind is
 "something SAX-like, with something a little DOM-like."

What I personally think would be best is to have multiple parsers. First you have something STAX-like (or maybe even lower level - I don't recall exactly what STAX gives you at the moment) that basically tokenizes the XML and returns a range of that. Then SAX and DOM parsers can be built on top of that. That way, you get the fastest parser possible as well as higher level, more functional parsers. But two of the biggest points of the design are that it's going to have to be range-based, and it's going to need to be able to take full advantage of slices (when used with any strings or random-access ranges) in order to avoid copying any of the data. That's the key design point which will allow a D parser to be extremely fast in comparison to parsers in most other languages.

I wrote something like that a while ago. It only accepted arrays as input because of the lack of a "buffered range" concept that'd allow lookahead and efficient slicing from any kind of range, but that could be retrofitted in. It implements pretty much all of the XML spec, except for documents having an internal subset (which is something a little arcane). It does not deal with namespaces either, I feel like that should be done a layer above, but I'm not entirely sure. Lower-level parser: http://michelf.ca/docs/d/mfr/xmltok.html Higher-level parser built on the first one: http://michelf.ca/docs/d/mfr/xml.html The code: http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip That code hasn't been compiled in a while, but it used to work very well for me. Feel free to use as a starting point. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Aug 29 2013
parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-09-03 16:11:37 +0000, "ilya-stromberg" 
<ilya-stromberg-2009 yandex.ru> said:

 On Thursday, 29 August 2013 at 16:14:28 UTC, Michel Fortin wrote:
 
 I wrote something like that a while ago.
 
 It only accepted arrays as input because of the lack of a "buffered 
 range" concept that'd allow lookahead and efficient slicing from any 
 kind of range, but that could be retrofitted in. It implements pretty 
 much all of the XML spec, except for documents having an internal 
 subset (which is something a little arcane). It does not deal with 
 namespaces either, I feel like that should be done a layer above, but 
 I'm not entirely sure.
 
 Lower-level parser:
 http://michelf.ca/docs/d/mfr/xmltok.html
 
 Higher-level parser built on the first one:
 http://michelf.ca/docs/d/mfr/xml.html
 
 The code:
 http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip
 
 That code hasn't been compiled in a while, but it used to work very 
 well for me. Feel free to use as a starting point.

Can you push it to the github, please?

Good idea. http://github.com/michelf/mfr-xml-d Feel free to send pull requests if you want. I should be able to review them. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Sep 03 2013
prev sibling next sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Thu, 29 Aug 2013 09:25:35 +0200
schrieb "w0rp" <devw0rp gmail.com>:

 Hello everybody. I've been wondering, what are the current plans 
 to replace std.xml? I'd like to help with the effort to get a 
 final XML library in phobos. So, I have a few questions.
 
 First, and most importantly, what do we except out of a D XML 
 library? I'd really like to have a discussion of the form, "Here 
 is exactly the interface the structs/classes need to implement, 
 go forth and implement." The general idea in my mind is 
 "something SAX-like, with something a little DOM-like." I'm aware 
 that std.xml has some issues support different encodings, so 
 obvious that's included.

I most points here also apply to std.xml: http://wiki.dlang.org/Wish_list/std.json Those are not strict requirements though, I just summarized what I remembered from old discussions.
 Second, is there an existing library that has gotten close to 
 meeting whatever we need for the first point? If so, how far away 
 is it from being able to meet all of the requirements and become 
 the standard library version?

There's a std.xml2 in the review queue: http://wiki.dlang.org/Review_Queue
Aug 29 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-08-29 10:15, Robert Schadek wrote:


 I think, this even extends to access to all semi- and structured-data.
 Think csv, sql nosql, you name it. Something which deserves a name like
 Uniform Access. I don't want to care if data is laid out differently. I
 want to define my struct or class mark the members to fill a pass it to
 somebodies code and don't want to care if its xml, sql or whatever.

So you want serialization :). Which we currently are reviewing. Unfortunately there might be too many changes needed to get it in Phobos this time. -- /Jacob Carlborg
Aug 29 2013
prev sibling next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Thursday, 29 August 2013 at 07:25:36 UTC, w0rp wrote:
 Hello everybody. I've been wondering, what are the current 
 plans to replace std.xml? I'd like to help with the effort to 
 get a final XML library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML 
 library? I'd really like to have a discussion of the form, 
 "Here is exactly the interface the structs/classes need to 
 implement, go forth and implement." The general idea in my mind 
 is "something SAX-like, with something a little DOM-like." I'm 
 aware that std.xml has some issues support different encodings, 
 so obvious that's included.

 Second, is there an existing library that has gotten close to 
 meeting whatever we need for the first point? If so, how far 
 away is it from being able to meet all of the requirements and 
 become the standard library version?

There is http://dsource.org/projects/xmlp, which at some point has been proposed for std.xml2. But that stalled for some time now.
Aug 29 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-31 17:43, ilya-stromberg wrote:

 Also, we have Tango Xml:
 https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml

 It's the fastest Xml parser in the world, so may be you can find it useful:
 dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/

 dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Unfortunately the Tango XML package will never end up in Phobos due to licensing issues. -- /Jacob Carlborg
Aug 31 2013
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-08-31 15:43:00 +0000, "ilya-stromberg" 
<ilya-stromberg-2009 yandex.ru> said:

 On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath wrote:
 There is http://dsource.org/projects/xmlp, which at some point has been 
 proposed for std.xml2. But that stalled for some time now.

Also, we have Tango Xml: https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml It's the fastest Xml parser in the world, so may be you can find it useful: dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Someone should benchmark it against the XML implementation I made. It has many of the same characteristics. For instance, Tango's SaxParser is based on its PullParser. This design requires the use a dynamic array to maintain a stack of opened elements. While not a huge performance hit, you don't need that if you use recursion, which you can do with my implementation. You can do that even though you can also use it as a pull tokenizer[^1] when needed (recursion is optional on a token-by-token basis). [^1]: IMHO, PullParser isn't a really good term for something that does not conform to the requirements of a parser in the XML spec. Tokenizer is a better term. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Aug 31 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-08-31 20:53, Michel Fortin wrote:

 [^1]: IMHO, PullParser isn't a really good term for something that does
 not conform to the requirements of a parser in the XML spec. Tokenizer
 is a better term.

I guess "Pull" is the key here. That it is the client's responsibility to fetch the next token, not the other way around. -- /Jacob Carlborg
Aug 31 2013
prev sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-09-02 13:34:18 +0000, "qznc" <qznc web.de> said:

 On Saturday, 31 August 2013 at 18:53:42 UTC, Michel Fortin wrote:
 For instance, Tango's SaxParser is based on its PullParser. This design 
 requires the use a dynamic array to maintain a stack of opened 
 elements. While not a huge performance hit, you don't need that if you 
 use recursion, which you can do with my implementation. You can do that 
 even though you can also use it as a pull tokenizer[^1] when needed 
 (recursion is optional on a token-by-token basis).

Recursion means you use the call stack instead of stack object on the heap. Be careful about nesting deepness. There are XML documents out there with thousands and more nested elements. With recursion on a 32bit machine you might get a stack overflow, but a heap-stack could handle a million nested elements.

Good point about caring for pathological cases. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Sep 02 2013
prev sibling parent Richard Webb <richard.webb boldonjames.com> writes:
On 31/08/2013 16:43, ilya-stromberg wrote:
 It's the fastest Xml parser in the world, so may be you can find it useful:
 dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/

 dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Has anyone done any benchmarks recently to see if that is still the case? I did some (admitedly brief) tests last year and found that xmlp was actually faster at building large XML docs into a DOM. There have been lots of changes since then, so i don't know if that is still the case.
Sep 02 2013
prev sibling next sibling parent Robert Schadek <realburner gmx.de> writes:
On 08/29/2013 09:51 AM, Johannes Pfau wrote:
 I most points here also apply to std.xml:
 http://wiki.dlang.org/Wish_list/std.json Those are not strict
 requirements though, I just summarized what I remembered from old
 discussions.

Think csv, sql nosql, you name it. Something which deserves a name like Uniform Access. I don't want to care if data is laid out differently. I want to define my struct or class mark the members to fill a pass it to somebodies code and don't want to care if its xml, sql or whatever.
Aug 29 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 29, 2013 11:08:18 Jacob Carlborg wrote:
 On 2013-08-29 09:47, Jonathan M Davis wrote:
 Personally, I would have just said use ranges of dchar and be done with it
 without worrying about character encodings at all, but I don't remember
 what all the XML standard does with encodings.

Won't that have the same problem as we talked about in of the threads about a D lexer? That is, doing unnecessary en/decoding.

Possibly, but then all you have to do is make it so that it treats strings as ranges of code units (and possibly support ranges of char and wchar), and you can avoid the unnecessary decoding. But aside from possibly support ranges of char or wchar, that would be completely internal to the parser, and the caller wouldn't care. An alternative would be to specifically support ranges of ubyte instead of strings, though given that XML is usually treated as a string, that would arguably be a bit odd. Regardless, as far as strings go, it's easy enough to avoid decoding in the implementation. IIRC, everything in XML is ASCII anyway, with stuff like HTML codes to indicate Unicode characters. And if that's the case, avoiding unnecessary decoding is trivial when operating on strings. - Jonathan M Davis
Aug 29 2013
prev sibling next sibling parent "Joakim" <joakim airpost.net> writes:
On Thursday, 29 August 2013 at 07:47:35 UTC, Jonathan M Davis 
wrote:
 There are several D XML libraries floating around, but no one 
 has taken the
 time to get any of the prepared for the Phobos review queue, 
 and I suspect
 that very few of them are range-based like the Phobos XML 
 solution needs to
 be, but I don't know.

nobody using D would use a dumb tech like XML. Let's keep it that way. :)
Aug 29 2013
prev sibling next sibling parent Robert Schadek <realburner gmx.de> writes:
On 08/29/2013 11:09 AM, Jacob Carlborg wrote:
 So you want serialization :). Which we currently are reviewing.
 Unfortunately there might be too many changes needed to get it in
 Phobos this time.

more transparent interface and I want to define join results types at compile time and and and
Aug 29 2013
prev sibling next sibling parent maarten van damme <maartenvd1994 gmail.com> writes:
--047d7bb03946fbf0c304e51385d9
Content-Type: text/plain; charset=ISO-8859-1

and imagine someone forced to use xml who reads this answer from the
community :p

std.xml is a must, no doubt.


2013/8/29 Joakim <joakim airpost.net>

 On Thursday, 29 August 2013 at 07:47:35 UTC, Jonathan M Davis wrote:

 There are several D XML libraries floating around, but no one has taken
 the
 time to get any of the prepared for the Phobos review queue, and I suspect
 that very few of them are range-based like the Phobos XML solution needs
 to
 be, but I don't know.

using D would use a dumb tech like XML. Let's keep it that way. :)

--047d7bb03946fbf0c304e51385d9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">and imagine someone forced to use xml who reads this answe= r from the community :p<div><br></div><div>std.xml is a must, no doubt.</di= v></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013/= 8/29 Joakim <span dir=3D"ltr">&lt;<a href=3D"mailto:joakim airpost.net" tar= get=3D"_blank">joakim airpost.net</a>&gt;</span><br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im">On Thursday, 29 August 201= 3 at 07:47:35 UTC, Jonathan M Davis wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> There are several D XML libraries floating around, but no one has taken the= <br> time to get any of the prepared for the Phobos review queue, and I suspect<= br> that very few of them are range-based like the Phobos XML solution needs to= <br> be, but I don&#39;t know.<br> </blockquote></div> I think it&#39;s great that there&#39;s no std.xml, as it implies that nobo= dy using D would use a dumb tech like XML. =A0Let&#39;s keep it that way. := )<br> </blockquote></div><br></div> --047d7bb03946fbf0c304e51385d9--
Aug 29 2013
prev sibling next sibling parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
 I think it's great that there's no std.xml, as it implies that 
 nobody using D would use a dumb tech like XML.  Let's keep it 
 that way. :)

No way around XML. A must have, as has been said in this thread. But what would you suggest as a better alternative to XML. It might be worth creating modules for alternative too (like JSON).
Aug 29 2013
prev sibling next sibling parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 29 August 2013 at 13:20:40 UTC, Jacob Carlborg wrote:
 On 2013-08-29 11:23, Jonathan M Davis wrote:

 IIRC, everything in XML is
 ASCII anyway, with stuff like HTML codes to indicate Unicode 
 characters. And if
 that's the case, avoiding unnecessary decoding is trivial when 
 operating on
 strings.

What! I hardly believe that. That might be the case for HTML but I don't think it is for XML. There are many file formats that are based on XML. I don't think all those use HTML codes. This is what W3 Schools says: "XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. To avoid errors, specify the XML encoding, or save XML files as Unicode.".

And while we're at it, what about YAML? It's a subset of JSON which means the new json.d module will handle it, I suppose.
Aug 29 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 29, 2013 at 01:14:19PM +0200, Chris wrote:
 On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
I think it's great that there's no std.xml, as it implies that
nobody using D would use a dumb tech like XML.  Let's keep it that
way. :)

No way around XML. A must have, as has been said in this thread. But what would you suggest as a better alternative to XML. It might be worth creating modules for alternative too (like JSON).

While I do agree that in the current state of affairs, XML support is a must, I also think that XML is just way overengineered, IMNSHO. It has adds too much overhead and therefore requires compression to be efficient, and it is needlessly complex for what it does (tag attributes, all the different cases of CDATA / non-CDATA, etc.). This complexity makes it impractical to edit by hand, relegating it to machine reading/writing only, which then begs the question of why a binary format wasn't chosen instead. And don't get me started on DTDs, which are incredibly convoluted and can't even express certain things that one might want to express in an automatic validation system. Or that 17-headed monster called XSLT, which, thankfully, is fading into the obscurity of time. JSON is a nicer, simpler alternative, though there may be limitations with it that I don't know about. Word on the street is that many people are abandoning XML for JSON due to lower maintenance overhead (and this includes one of my friends, who was a hardcore XML fanatic -- I was frankly quite surprised when he told me he was considering migrating to JSON, since the original reason he chose XML was so that his data will future-proof... well, so much for *that*). But all of this is irrelevant... it doesn't alleviate the need for a std.xml replacement, since we have to live in the real world where XML exists and must be supported. :) T -- Life would be easier if I had the source code. -- YHL
Aug 29 2013
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/29/13 12:25 AM, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to
 replace std.xml? I'd like to help with the effort to get a final XML
 library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library?
 I'd really like to have a discussion of the form, "Here is exactly the
 interface the structs/classes need to implement, go forth and
 implement." The general idea in my mind is "something SAX-like, with
 something a little DOM-like." I'm aware that std.xml has some issues
 support different encodings, so obvious that's included.

 Second, is there an existing library that has gotten close to meeting
 whatever we need for the first point? If so, how far away is it from
 being able to meet all of the requirements and become the standard
 library version?

I don't know much about XML, but I noticed there are a few popular libraries and models. I'd expect a replacement for std.xml would choose one of these popular models that is most appropriate for D. Andrei
Aug 29 2013
prev sibling next sibling parent "Joakim" <joakim airpost.net> writes:
On Thursday, 29 August 2013 at 11:14:21 UTC, Chris wrote:
 On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
 I think it's great that there's no std.xml, as it implies that 
 nobody using D would use a dumb tech like XML.  Let's keep it 
 that way. :)

No way around XML. A must have, as has been said in this thread. But what would you suggest as a better alternative to XML. It might be worth creating modules for alternative too (like JSON).

be great for Phobos to nudge users in better directions, by having a std.json but no std.xml. There'll always be outside libraries to process XML, for those who can't go without, perhaps a list of XML libraries can be added to the wiki: http://wiki.dlang.org/Libraries_and_Frameworks I see no use for XML, as it's a horrible solution in search of a problem, but for those who must use it, they can always get it outside Phobos. Just a suggestion.
Aug 29 2013
prev sibling next sibling parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 29 August 2013 at 15:43:36 UTC, H. S. Teoh wrote:
 While I do agree that in the current state of affairs, XML 
 support is a
 must, I also think that XML is just way overengineered, IMNSHO. 
 It has
 adds too much overhead and therefore requires compression to be
 efficient, and it is needlessly complex for what it does (tag
 attributes, all the different cases of CDATA / non-CDATA, 
 etc.). This
 complexity makes it impractical to edit by hand, relegating it 
 to
 machine reading/writing only, which then begs the question of 
 why a
 binary format wasn't chosen instead. And don't get me started 
 on DTDs,
 which are incredibly convoluted and can't even express certain 
 things
 that one might want to express in an automatic validation 
 system. Or
 that 17-headed monster called XSLT, which, thankfully, is 
 fading into
 the obscurity of time.

 JSON is a nicer, simpler alternative, though there may be 
 limitations
 with it that I don't know about. Word on the street is that 
 many people
 are abandoning XML for JSON due to lower maintenance overhead 
 (and this
 includes one of my friends, who was a hardcore XML fanatic -- I 
 was
 frankly quite surprised when he told me he was considering 
 migrating to
 JSON, since the original reason he chose XML was so that his 
 data will
 future-proof... well, so much for *that*).

 But all of this is irrelevant... it doesn't alleviate the need 
 for a
 std.xml replacement, since we have to live in the real world 
 where XML
 exists and must be supported. :)


 T

I am moving away from XML too. Wanted to use it for a private project. But I soon realized the madness of it, especially when there are people involved who are not programmers and have no clue whatsoever about markup languages, data storage formats etc. I think JSON and YAML are good candidates for the private project which revolves around collecting words and phrases and archiving them. I don't know exactly what I will use, but XML definitely won't get the job. DTD sounds too much like DDT!
Aug 29 2013
prev sibling next sibling parent "Jason den Dulk" <public2 jasondendulk.com> writes:
On Thursday, 29 August 2013 at 15:43:36 UTC, H. S. Teoh wrote:
 JSON is a nicer, simpler alternative, though there may be 
 limitations
 with it that I don't know about.

The main disavantage of JSON vs XML is lack of validation. Whenever I write code that works with JSON (or any data format), I have to write extra code to perform validation. If there was a validation addon for JSON, you could nix XML for good. Regards Jason
Aug 29 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 29, 2013 15:20:39 Jacob Carlborg wrote:
 On 2013-08-29 11:23, Jonathan M Davis wrote:
 IIRC, everything in XML is
 ASCII anyway, with stuff like HTML codes to indicate Unicode characters.
 And if that's the case, avoiding unnecessary decoding is trivial when
 operating on strings.

What! I hardly believe that. That might be the case for HTML but I don't think it is for XML. There are many file formats that are based on XML. I don't think all those use HTML codes. This is what W3 Schools says: "XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. To avoid errors, specify the XML encoding, or save XML files as Unicode.".

Well, as I said, I couldn't remember exactly what the XML standard said about encodings, but if it can contain non-ASCII characters, then my first inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the fact that that's what we support in the language and in Phobos (as I understand it, std.encodings is a bit of a joke that needs to be rethought and replaced, but regardless, it's the only Phobos module supporting any non- Unicode encodings). However, because all of the XML special symbols should be ASCII, you should still be able to avoid decoding characters for the most part. It's only when you have to actually look at the content that Unicode would potentially matter. So, the performance hit of decoding Unicode characters should mostly be able to be avoided. - Jonathan M Davis
Aug 29 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 29, 2013 12:14:28 Michel Fortin wrote:
 On 2013-08-29 07:47:17 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:
 On Thursday, August 29, 2013 09:25:35 w0rp wrote:
 The general idea in my mind is
 "something SAX-like, with something a little DOM-like."

What I personally think would be best is to have multiple parsers. First you have something STAX-like (or maybe even lower level - I don't recall exactly what STAX gives you at the moment) that basically tokenizes the XML and returns a range of that. Then SAX and DOM parsers can be built on top of that. That way, you get the fastest parser possible as well as higher level, more functional parsers. But two of the biggest points of the design are that it's going to have to be range-based, and it's going to need to be able to take full advantage of slices (when used with any strings or random-access ranges) in order to avoid copying any of the data. That's the key design point which will allow a D parser to be extremely fast in comparison to parsers in most other languages.

It only accepted arrays as input because of the lack of a "buffered range" concept that'd allow lookahead and efficient slicing from any kind of range, but that could be retrofitted in. It implements pretty much all of the XML spec, except for documents having an internal subset (which is something a little arcane). It does not deal with namespaces either, I feel like that should be done a layer above, but I'm not entirely sure. Lower-level parser: http://michelf.ca/docs/d/mfr/xmltok.html Higher-level parser built on the first one: http://michelf.ca/docs/d/mfr/xml.html The code: http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip That code hasn't been compiled in a while, but it used to work very well for me. Feel free to use as a starting point.

Cool. I started looking at implementing something like that a while back but really didn't have time to get very far. But if we really care about efficiency, I think that that's the basic approach that we need to take. However, the trick as always is someone having the time to do it. Maybe one of us can take what you did and start from there or at least use it is an example to start from. - Jonathan M Davis
Aug 29 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Thursday, 29 August 2013 at 17:38:43 UTC, Jonathan M Davis 
wrote:
 Well, as I said, I couldn't remember exactly what the XML 
 standard said about
 encodings, but if it can contain non-ASCII characters, then my 
 first
 inclination is to say that it has to be UTF-8, UTF-16, or 
 UTF-32 based on the
 fact that that's what we support in the language and in Phobos 
 (as I
 understand it, std.encodings is a bit of a joke that needs to 
 be rethought and
 replaced, but regardless, it's the only Phobos module 
 supporting any non-
 Unicode encodings).

 However, because all of the XML special symbols should be 
 ASCII, you should
 still be able to avoid decoding characters for the most part. 
 It's only when
 you have to actually look at the content that Unicode would 
 potentially
 matter. So, the performance hit of decoding Unicode characters 
 should mostly
 be able to be avoided.

 - Jonathan M Davis

You just specify the encoding in the root element. <?xml version="1.0" encoding="us-ascii"?> <?xml version="1.0" encoding="windows-1252"?> <?xml version="1.0" encoding="ISO-8859-1"?> <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-16"?> UTF-8 is the default in lieu of a BOM saying otherwise.
Aug 29 2013
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
 I think it's great that there's no std.xml, as it implies that 
 nobody using D would use a dumb tech like XML.  Let's keep it 
 that way. :)

JSON is better than XML in every way I can think of. Easier to map to data structures in whichever language you're using, much smaller in size, less corner cases, etc. However, just saying XML is dumb isn't a useful policy. You need ways of parsing XML on hand until people stop using it. On Thursday, 29 August 2013 at 08:15:39 UTC, Robert Schadek wrote:
 On 08/29/2013 09:51 AM, Johannes Pfau wrote:
 I most points here also apply to std.xml:
 t Those are not strict
 requirements though, I just summarized what I remembered from 
 old
 discussions.

structured-data. Think csv, sql nosql, you name it. Something which deserves a name like Uniform Access. I don't want to care if data is laid out differently. I want to define my struct or class mark the members to fill a pass it to somebodies code and don't want to care if its xml, sql or whatever.

I'm really not so sure about that kind of approach. Automatic serialisation I think works one of two ways. Either you have control over the data you're pulling in, and you can change it to map more easily to your data structures, or you don't and you have to make your data structures more ugly to fit the data you're pulling in. I prefer just writing functions that take format X and give you in-memory representation Y over automatic serialisation stuff. I know it's boring and easy to write functions like that, but why can't some things just be boring and easy? This looks like a really popular topic, and it's cool that there seem to be quite a few implementations that are close to being what we want. I think we're probably not far off just lining up a few different implementations and reviewing them all for possible inclusion in phobos.
Aug 29 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis wrote:
[...]
 Well, as I said, I couldn't remember exactly what the XML standard said about 
 encodings, but if it can contain non-ASCII characters, then my first 
 inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the 
 fact that that's what we support in the language and in Phobos

Take a look here: http://www.w3schools.com/xml/xml_encoding.asp XML files can have *any* valid encoding, including nastiness like windows-1252 and relics like iso-8859-1. Unfortunately, I don't think we have a way around this, since existing XML files out there probably already have all of these encodings are more, and std.xml is gonna hafta support 'em all. Otherwise we're gonna get irate users complaining "why can't std.xml parse my oddly-encoded-but-standards-compliant XML file?!"
 (as I understand it, std.encodings is a bit of a joke that needs to be
 rethought and replaced, but regardless, it's the only Phobos module
 supporting any non- Unicode encodings).

No kidding! I was trying to write a program that navigates a website automatically using std.net.curl, and I'm running into all sorts of silly roadblocks, including std.encoding not supporting iso-8859-* encodings. The good news is that on Linux, there's a handy utility called 'recode', which comes with a library called 'librecode', that supports converting between a huge number of different encodings -- many more than probably you or I have imagined existed -- including to/from Unicode. I know we don't like including external libraries in Phobos, but I honestly don't see any justification for reinventing the wheel by writing (and maintaining!) our own equivalent to librecode, unless licensing issues prevents us from including librecode in Phobos, nicely wrapped in a modern range-based D API.
 However, because all of the XML special symbols should be ASCII, you
 should still be able to avoid decoding characters for the most part.
 It's only when you have to actually look at the content that Unicode
 would potentially matter. So, the performance hit of decoding Unicode
 characters should mostly be able to be avoided.

One way is to write the core code of std.xml in such a way that it handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit encodings) so that it's encoding-independent. Then on top of this core, write some convenience wrappers that casts/converts to string, wstring, dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32 if the user asks for string/wstring/dstring, and leave XML in other encodings up to the user to decode manually. This way, at least the user can get the data out of the file. Later on, once we've gotten our act together with std.encoding, we can hook it up to std.xml to provide autoconversion. T -- Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen
Aug 29 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Thursday, 29 August 2013 at 18:58:57 UTC, H. S. Teoh wrote:
 No kidding! I was trying to write a program that navigates a 
 website
 automatically using std.net.curl, and I'm running into all 
 sorts of
 silly roadblocks, including std.encoding not supporting 
 iso-8859-*
 encodings.

It doesn't look like adding the rest of the ISO-8859 encodings would be all that difficult if you used the existing ISO-8859-1 (Latin1) as a base. I don't quite understand where and how transcoding is done though.
 The good news is that on Linux, there's a handy utility called 
 'recode',
 which comes with a library called 'librecode', that supports 
 converting
 between a huge number of different encodings -- many more than 
 probably
 you or I have imagined existed -- including to/from Unicode.  I 
 know we
 don't like including external libraries in Phobos, but I 
 honestly don't
 see any justification for reinventing the wheel by writing (and
 maintaining!) our own equivalent to librecode, unless licensing 
 issues
 prevents us from including librecode in Phobos, nicely wrapped 
 in a
 modern range-based D API.


 However, because all of the XML special symbols should be 
 ASCII, you
 should still be able to avoid decoding characters for the most 
 part.
 It's only when you have to actually look at the content that 
 Unicode
 would potentially matter. So, the performance hit of decoding 
 Unicode
 characters should mostly be able to be avoided.

One way is to write the core code of std.xml in such a way that it handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit encodings) so that it's encoding-independent. Then on top of this core, write some convenience wrappers that casts/converts to string, wstring, dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32 if the user asks for string/wstring/dstring, and leave XML in other encodings up to the user to decode manually. This way, at least the user can get the data out of the file. Later on, once we've gotten our act together with std.encoding, we can hook it up to std.xml to provide autoconversion. T

Aug 29 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 29, 2013 21:28:09 Jacob Carlborg wrote:
 On 2013-08-29 19:38, Jonathan M Davis wrote:
 However, because all of the XML special symbols should be ASCII, you
 should
 still be able to avoid decoding characters for the most part. It's only
 when you have to actually look at the content that Unicode would
 potentially matter. So, the performance hit of decoding Unicode
 characters should mostly be able to be avoided.

I don't understand. If use a range of dchar and call "front" and "popFront" won't it do decoding then?

Any decent parser is going to special-case strings (especially if it's using slicing), in which case, it won't call front unless it needs to decode. The only real question is whether generic char and wchar ranges should be supported, because then you could avoid the decoding for ranges that aren't strings, but strings are already covered simply by special casing. You really can't afford to not special-case for strings for algorithms in general if efficiency is a high priority. - Jonathan M Davis
Aug 29 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Thursday, 29 August 2013 at 08:15:39 UTC, Robert Schadek wrote:
 On 08/29/2013 09:51 AM, Johannes Pfau wrote:
 I most points here also apply to std.xml:
 http://wiki.dlang.org/Wish_list/std.json Those are not strict
 requirements though, I just summarized what I remembered from 
 old
 discussions.

structured-data. Think csv, sql nosql, you name it. Something which deserves a name like Uniform Access. I don't want to care if data is laid out differently. I want to define my struct or class mark the members to fill a pass it to somebodies code and don't want to care if its xml, sql or whatever.

That's a really great point. All of these modules that can't know the types and structure in advance should probably all use the same techniques for handling the situation. Perhaps a new module to unify all this stuff is in order. I seem to recall Adam D. Ruppe's "Is this D or is this Javascript?" thread[1] having some nice tricks to deal with dynamically typed data. 1. http://forum.dlang.org/thread/kuxfkakrgjaofkrdvgmx forum.dlang.org
Aug 29 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Thursday, 29 August 2013 at 19:40:08 UTC, Brad Anderson wrote:
 That's a really great point.  All of these modules that can't 
 know the types and structure in advance should probably all use 
 the same techniques for handling the situation.  Perhaps a new 
 module to unify all this stuff is in order.

 I seem to recall Adam D. Ruppe's "Is this D or is this 
 Javascript?" thread[1] having some nice tricks to deal with 
 dynamically typed data.

 1. 
 http://forum.dlang.org/thread/kuxfkakrgjaofkrdvgmx forum.dlang.org

(or maybe just improve Variant)
Aug 29 2013
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
On Aug 29, 2013, at 11:57 AM, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:
=20
 One way is to write the core code of std.xml in such a way that it
 handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit
 encodings) so that it's encoding-independent. Then on top of this =

 write some convenience wrappers that casts/converts to string, =

 dstring. As an initial stab, we could support only UTF-8, UTF-16, =

 if the user asks for string/wstring/dstring, and leave XML in other
 encodings up to the user to decode manually. This way, at least the =

 can get the data out of the file.
=20
 Later on, once we've gotten our act together with std.encoding, we can
 hook it up to std.xml to provide autoconversion.

As long autoconversion is optional. When parsing XML or JSON or = whatever, I generally only care about specific strings, and sometimes = don't want anything decoded at all. Having decoding done automatically = before the event fires is a huge and potentially unnecessary performance = hit. Not doing this decoding automatically is what makes the Tango XML = parser so fast.=
Aug 29 2013
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Thursday, 29 August 2013 at 20:08:10 UTC, Sean Kelly wrote:
 As long autoconversion is optional.  When parsing XML or JSON 
 or whatever, I generally only care about specific strings, and 
 sometimes don't want anything decoded at all.  Having decoding 
 done automatically before the event fires is a huge and 
 potentially unnecessary performance hit.  Not doing this 
 decoding automatically is what makes the Tango XML parser so 
 fast.

This makes me wonder what kind of optimizations a hypothetical ctXml could perform.
Aug 29 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 29, 2013 at 12:41:16PM -0700, Sean Kelly wrote:
 On Aug 29, 2013, at 11:57 AM, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:
 
 One way is to write the core code of std.xml in such a way that it
 handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit
 encodings) so that it's encoding-independent. Then on top of this
 core, write some convenience wrappers that casts/converts to string,
 wstring, dstring. As an initial stab, we could support only UTF-8,
 UTF-16, UTF-32 if the user asks for string/wstring/dstring, and
 leave XML in other encodings up to the user to decode manually. This
 way, at least the user can get the data out of the file.
 
 Later on, once we've gotten our act together with std.encoding, we
 can hook it up to std.xml to provide autoconversion.

As long autoconversion is optional. When parsing XML or JSON or whatever, I generally only care about specific strings, and sometimes don't want anything decoded at all. Having decoding done automatically before the event fires is a huge and potentially unnecessary performance hit. Not doing this decoding automatically is what makes the Tango XML parser so fast.

Right, that's why I said the core of std.xml should handle everything as bytes, only specially treating the ASCII values of <, >, &, and other metacharacters. The tagname and tag body should just be a range over segments of the input. T -- What are you when you run out of Monet? Baroque.
Aug 29 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 29, 2013 14:27:22 H. S. Teoh wrote:
 Right, that's why I said the core of std.xml should handle everything as
 bytes, only specially treating the ASCII values of <, >, &, and other
 metacharacters. The tagname and tag body should just be a range over
 segments of the input.

That works especially well with how Michel and I were thinking it should be split up with a core that essentially just gives you a range of XML tokens/tags. You then have separate SAX and/or DOM parsers on top of that (which also should minimize decoding, but they actually have to care about decoding in some cases in order to do stuff like check matching tags). - Jonathan M Davis
Aug 29 2013
prev sibling next sibling parent "Chris" <wendlec tcd.ie> writes:
On Thursday, 29 August 2013 at 19:26:21 UTC, Jacob Carlborg wrote:
 On 2013-08-29 16:07, Chris wrote:

 And while we're at it, what about YAML? It's a subset of JSON 
 which
 means the new json.d module will handle it, I suppose.

YAML is a super set of JSON, not the other way around. But yes, I would like to have YAML support as well.

Yes of course, you are right. I found this on the internet. Seems to be abandoned. https://github.com/kiith-sa/D-YAML
Aug 29 2013
prev sibling next sibling parent "Kiith-Sa" <kiithsacmp gmail.com> writes:
On Thursday, 29 August 2013 at 22:56:36 UTC, Chris wrote:
 On Thursday, 29 August 2013 at 19:26:21 UTC, Jacob Carlborg 
 wrote:
 On 2013-08-29 16:07, Chris wrote:

 And while we're at it, what about YAML? It's a subset of JSON 
 which
 means the new json.d module will handle it, I suppose.

YAML is a super set of JSON, not the other way around. But yes, I would like to have YAML support as well.

Yes of course, you are right. I found this on the internet. Seems to be abandoned. https://github.com/kiith-sa/D-YAML

It's not really abandoned, I keep updating it with compatibility fixes for new DMD releases as my other projects depend on it. Its API does not fit into Phobos, however (not range-based), and it won't unless I find a few weeks/months to work on it exclusively, which is unlikely in the near future. It also only supports YAML 1.1 at the moment, and recursive data structures are not yet supported.
Aug 29 2013
prev sibling next sibling parent "ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:
On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath 
wrote:
 There is http://dsource.org/projects/xmlp, which at some point 
 has been proposed for std.xml2. But that stalled for some time 
 now.

Also, we have Tango Xml: https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml It's the fastest Xml parser in the world, so may be you can find it useful: dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/
Aug 31 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/29/2013 12:25 AM, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to replace
 std.xml? I'd like to help with the effort to get a final XML library in phobos.
 So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library? I'd
 really like to have a discussion of the form, "Here is exactly the interface
the
 structs/classes need to implement, go forth and implement." The general idea in
 my mind is "something SAX-like, with something a little DOM-like." I'm aware
 that std.xml has some issues support different encodings, so obvious that's
 included.

 Second, is there an existing library that has gotten close to meeting whatever
 we need for the first point? If so, how far away is it from being able to meet
 all of the requirements and become the standard library version?

The Tango implementation of XML has been very well received. I haven't looked at it, but it was designed to do no memory allocation - it just did slices over the input. I don't believe it should make any attempt at decoding. Decoding entails both performance loss and memory consumption. If the user wants to do decoding, they can layer it on the output. And lastly, it should of course sport a range interface.
Aug 31 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 29 August 2013 at 18:58:57 UTC, H. S. Teoh wrote:
 On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis 
 wrote:
 [...]
 Well, as I said, I couldn't remember exactly what the XML 
 standard said about encodings, but if it can contain non-ASCII 
 characters, then my first inclination is to say that it has to 
 be UTF-8, UTF-16, or UTF-32 based on the fact that that's what 
 we support in the language and in Phobos

Take a look here: http://www.w3schools.com/xml/xml_encoding.asp XML files can have *any* valid encoding, including nastiness like windows-1252 and relics like iso-8859-1. Unfortunately, I don't think we have a way around this, since existing XML files out there probably already have all of these encodings are more, and std.xml is gonna hafta support 'em all. Otherwise we're gonna get irate users complaining "why can't std.xml parse my oddly-encoded-but-standards-compliant XML file?!"

As this is not the first time I see it used as a reliable source, no, w3school is full of shit. Don't use that website when looking for precise high quality information.
Aug 31 2013
prev sibling next sibling parent "ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:
On Saturday, 31 August 2013 at 18:03:10 UTC, Jacob Carlborg wrote:
 Unfortunately the Tango XML package will never end up in Phobos 
 due to licensing issues.

Yes, but we can always learn source code and put attention to the design solutions.
Sep 01 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, September 01, 2013 10:02:50 ilya-stromberg wrote:
 On Saturday, 31 August 2013 at 18:03:10 UTC, Jacob Carlborg wrote:
 Unfortunately the Tango XML package will never end up in Phobos
 due to licensing issues.

Yes, but we can always learn source code and put attention to the design solutions.

Not really. Looking at the source code effectively taints you. By doing so, you run the risk of being accused of copying if anything you do is similar enough. It's just safer to never look at source code when the license is going to make it so that you can't use that code. - Jonathan M Davis
Sep 01 2013
prev sibling next sibling parent "qznc" <qznc web.de> writes:
On Saturday, 31 August 2013 at 18:53:42 UTC, Michel Fortin wrote:
 On 2013-08-31 15:43:00 +0000, "ilya-stromberg" 
 <ilya-stromberg-2009 yandex.ru> said:

 On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath 
 wrote:
 There is http://dsource.org/projects/xmlp, which at some 
 point has been proposed for std.xml2. But that stalled for 
 some time now.

Also, we have Tango Xml: https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml It's the fastest Xml parser in the world, so may be you can find it useful: dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Someone should benchmark it against the XML implementation I made. It has many of the same characteristics. For instance, Tango's SaxParser is based on its PullParser. This design requires the use a dynamic array to maintain a stack of opened elements. While not a huge performance hit, you don't need that if you use recursion, which you can do with my implementation. You can do that even though you can also use it as a pull tokenizer[^1] when needed (recursion is optional on a token-by-token basis).

Recursion means you use the call stack instead of stack object on the heap. Be careful about nesting deepness. There are XML documents out there with thousands and more nested elements. With recursion on a 32bit machine you might get a stack overflow, but a heap-stack could handle a million nested elements.
Sep 02 2013
prev sibling next sibling parent reply Lionello Lunesu <lionello lunesu.remove.com> writes:
On 8/29/13 15:25, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to
 replace std.xml? I'd like to help with the effort to get a final XML
 library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library?
 I'd really like to have a discussion of the form, "Here is exactly the
 interface the structs/classes need to implement, go forth and
 implement." The general idea in my mind is "something SAX-like, with
 something a little DOM-like." I'm aware that std.xml has some issues
 support different encodings, so obvious that's included.

 Second, is there an existing library that has gotten close to meeting
 whatever we need for the first point? If so, how far away is it from
 being able to meet all of the requirements and become the standard
 library version?

Having been the lead programmer on the Microsoft XML team for three years, I can easily say that the most popular XML API [on MS stack] is the XmlReader and XLinq in .NET. (This has nothing to do with LINQ, by the way.) I'd be willing to help make D versions of that, but my time is limited. But as usual, I don't think it's the actual coding that will take time. Designing a good interface is the hardest part, and I'd consider that part done. L.
Sep 02 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/2/13 7:40 PM, Lionello Lunesu wrote:
 Having been the lead programmer on the Microsoft XML team for three
 years, I can easily say that the most popular XML API [on MS stack] is
 the XmlReader and XLinq in .NET. (This has nothing to do with LINQ, by
 the way.)

 I'd be willing to help make D versions of that, but my time is limited.
 But as usual, I don't think it's the actual coding that will take time.
 Designing a good interface is the hardest part, and I'd consider that
 part done.

This is great info, thanks. Andrei
Sep 03 2013
prev sibling next sibling parent Peter Williams <pwil3058 bigpond.net.au> writes:
On 03/09/13 12:40, Lionello Lunesu wrote:
 On 8/29/13 15:25, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to
 replace std.xml? I'd like to help with the effort to get a final XML
 library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library?
 I'd really like to have a discussion of the form, "Here is exactly the
 interface the structs/classes need to implement, go forth and
 implement." The general idea in my mind is "something SAX-like, with
 something a little DOM-like." I'm aware that std.xml has some issues
 support different encodings, so obvious that's included.

 Second, is there an existing library that has gotten close to meeting
 whatever we need for the first point? If so, how far away is it from
 being able to meet all of the requirements and become the standard
 library version?

Having been the lead programmer on the Microsoft XML team for three years, I can easily say that the most popular XML API [on MS stack] is the XmlReader and XLinq in .NET. (This has nothing to do with LINQ, by the way.) I'd be willing to help make D versions of that, but my time is limited. But as usual, I don't think it's the actual coding that will take time. Designing a good interface is the hardest part, and I'd consider that part done. L.

For whoever ends up doing std.xml's replacement, it would be good if some of the lower level interfaces such as encode/decode (for escaping/unescaping within text) were exposed. I'm finding the ones in std.xml useful for implementing markup in label widgets during my investigation into reimplementing the GTK+ (modified) interface in D. Of course, there's always the chance that the new D xml API provides enough to make my markup code redundant. It's possible that the current high level APIs in std.xml also provides enough make my work redundant but I decided not to investigate this possibility after I saw the "will be replaced by something different" warning. Cheers Peter
Sep 02 2013
prev sibling parent "ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:
On Thursday, 29 August 2013 at 16:14:28 UTC, Michel Fortin wrote:
 I wrote something like that a while ago.

 It only accepted arrays as input because of the lack of a 
 "buffered range" concept that'd allow lookahead and efficient 
 slicing from any kind of range, but that could be retrofitted 
 in. It implements pretty much all of the XML spec, except for 
 documents having an internal subset (which is something a 
 little arcane). It does not deal with namespaces either, I feel 
 like that should be done a layer above, but I'm not entirely 
 sure.

 Lower-level parser:
 http://michelf.ca/docs/d/mfr/xmltok.html

 Higher-level parser built on the first one:
 http://michelf.ca/docs/d/mfr/xml.html

 The code:
 http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip

 That code hasn't been compiled in a while, but it used to work 
 very well for me. Feel free to use as a starting point.

Can you push it to the github, please?
Sep 03 2013