www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Status of std.xml (D2/Phobos)

reply Justin Johansson <no spam.com> writes:
May I ask is anybody working on redeveloping std.xml in the D2/Phobos 
library?  (Currently it looks like it needs to be started over from scratch)

Also what is the level of interest from library users for decent XML 
support in D2/Phobos?

Cheers
Justin Johansson
Jun 27 2010
next sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
Justin Johansson wrote:

 May I ask is anybody working on redeveloping std.xml in the D2/Phobos
 library?  (Currently it looks like it needs to be started over from scratch)
 
 Also what is the level of interest from library users for decent XML
 support in D2/Phobos?
 
 Cheers
 Justin Johansson

Interested, very much so. I think many people are.
Jun 27 2010
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Justin Johansson <no spam.com> wrote:
 Also what is the level of interest from library users for decent XML  
 support in D2/Phobos?

Absolutely. It is a necessity. -- Simen
Jun 27 2010
prev sibling next sibling parent reply Justin Johansson <no spam.com> writes:
Justin Johansson wrote:
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos 
 library?  (Currently it looks like it needs to be started over from 
 scratch)
 
 Also what is the level of interest from library users for decent XML 
 support in D2/Phobos?
 
 Cheers
 Justin Johansson

Lutger said: Interested, very much so. I think many people are. Simen said: Absolutely. It is a necessity. Thanks for fast replies from Lutger and Simen. Being an XML/W3C addict myself I well concur with Simen and Lutger's sentiments. However, Lutger simply saying that he "thinks" *many* people are interested is not good enough for me. Would the *many* people also please add a voice along with Simen et. al. to inspire me to contribute to the D2 XML effort (of course under a Walter-endorsed style of licence). Brief about me: I believe I possess the skills and experience in XML/XSLT & other W3C stuff together with 20+ years as a C++ developer to contribute good, peer-reviewable work; It's just that I need inspiration to put this task into action. The only other thing is that any offer on my part is conditional upon obtaining a "sabbatical break", hopefully in the next month or so to be able to put in the time to make it happen. Cheers Justin Johansson
Jun 27 2010
next sibling parent reply Lutger <lutger.blijdestijn gmail.com> writes:
Justin Johansson wrote:

 Justin Johansson wrote:
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos
 library?  (Currently it looks like it needs to be started over from
 scratch)
 
 Also what is the level of interest from library users for decent XML
 support in D2/Phobos?
 
 Cheers
 Justin Johansson

Lutger said: Interested, very much so. I think many people are. Simen said: Absolutely. It is a necessity. Thanks for fast replies from Lutger and Simen. Being an XML/W3C addict myself I well concur with Simen and Lutger's sentiments. However, Lutger simply saying that he "thinks" *many* people are interested is not good enough for me.

Well I dare only speak for myself. But perhaps I do can better: Time and again complaints about std.xml surface, so that is one indicator. See this google query: http://www.google.nl/search?q=site%3Awww.digitalmars.com%2Fpnews+std.xml And this proposal to replace std.xml with kxml: http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=digitalmars.D&artnum=109646 Another point to consider is that dsource.org alone contains at least 5 xml projects (that I can see), some long abandoned but some seem to be active. There is also this code from Adam Ruppe: http://arsdnet.net/dcode/dom.d
Jun 27 2010
parent reply Justin Johansson <no spam.com> writes:
Lutger wrote:
 Justin Johansson wrote:
 
 Justin Johansson wrote:
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos
 library?  (Currently it looks like it needs to be started over from
 scratch)

 Also what is the level of interest from library users for decent XML
 support in D2/Phobos?

 Cheers
 Justin Johansson

Simen said: Absolutely. It is a necessity. Thanks for fast replies from Lutger and Simen. Being an XML/W3C addict myself I well concur with Simen and Lutger's sentiments. However, Lutger simply saying that he "thinks" *many* people are interested is not good enough for me.

Well I dare only speak for myself. But perhaps I do can better: Time and again complaints about std.xml surface, so that is one indicator. See this google query: http://www.google.nl/search?q=site%3Awww.digitalmars.com%2Fpnews+std.xml And this proposal to replace std.xml with kxml: http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=digitalmars.D&artnum=109646 Another point to consider is that dsource.org alone contains at least 5 xml projects (that I can see), some long abandoned but some seem to be active. There is also this code from Adam Ruppe: http://arsdnet.net/dcode/dom.d

Yes, understand. One wonders how many people have been put off D (and perhaps not to return) given such a large volume of hits under that URL, namely http://www.google.nl/search?q=site%3Awww.digitalmars.com%2Fpnews+std.xml On your other point about the 5+ XML projects on dsource.org, in your opinion, which of these do you think have the most promise, or at least a good grounding from which to start over? Naturally I don't expect you to waste your time looking into these 5+ project; just if you happen to know.
Jun 27 2010
parent Lutger <lutger.blijdestijn gmail.com> writes:
Justin Johansson wrote:
...
 
 On your other point about the 5+ XML projects on dsource.org, in your
 opinion, which of these do you think have the most promise, or at least
 a good grounding from which to start over?
 
 Naturally I don't expect you to waste your time looking into these 5+
 project; just if you happen to know.

Don't know really, sorry. From a quick glance though, I would say to start looking into xmlp and Adam Ruppe's code. xmlp even has conformance tests, perhaps you can work together with the author? http://www.dsource.org/projects/xmlp http://www.digitalmars.com/d/archives/digitalmars/D/XMLP_101327.html#N101327
Jun 27 2010
prev sibling next sibling parent reply Justin Johansson <no spam.com> writes:
Adam Ruppe wrote:
 I'm not terribly interested in it because I already wrote my own
 replacement: http://arsdnet.net/dcode/dom.d
 
 Mine is biased toward HTML, doing what I personally find useful, or
 mimicing what javascript in the browser would do instead of following
 the standard, but if there's anything in there that is useful to
 others, you're free to take it.

Thanks Adam for replying. I'm happy to take onboard contra-views such as yours as well. Naturally it is no point in putting in an effort wherein there is no interest at large. Still, I'll wait for more replies on this ng before making any decision whether or not to commit myself to a new "D2 XML" effort. btw. I feel it fair to add conjecture that a DOM implementation is pretty basic stuff and that a complete XML ecosystem it much larger than just this (i.e. an in-memory DOM). There are all sorts of abstractions (Andrei read ranges) and modeling that would form part of what I believe would be a major work, and one possibly even bigger than what one person like myself could ever hope to achieve. Of course, the mammoth effort by Michael Kay in producing the Saxon (Java-based) XSLT processor is a feat that few others will ever overshadow. Cheers Justin Johansson
Jun 27 2010
next sibling parent reply Justin Johansson <no spam.com> writes:
Adam Ruppe wrote:
 On 6/27/10, Justin Johansson <no spam.com> wrote:
 btw. I feel it fair to add conjecture that a DOM implementation
 is pretty basic stuff and that a complete XML ecosystem it much
 larger than just this (i.e. an in-memory DOM).

Yes, it is very simple, but so is all the XML I've ever actually encountered. I've seen ugly, convoluted HTML and I've seen name/value pairs in verbose XML format, but very very little in the middle. (Heck, I just used std.string.indexOf("<tagname") for quite a while.) This is probably due to my observation bias, with all my XML experience coming from working with web services.

Yeah, I understand where you are coming from; sometimes all you need is some simple DOM stuff which you can hack out yourself in a few hours. OTOH, there are some really significant W3C specs that you may or may not be aware of and these are really difficult to implement in regular imperative languages like C/C++ and Java. Java, being all that is the following of Java I guess, has had the most success in implementing these specs. IMHO, the two most fundamental and significant W3C specs that D libraries could well address are as follows. These form a large amount of the (formal) XML ecosystem. XML Schema Part 2: Datatypes Second Edition http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/ and XQuery 1.0 and XPath 2.0 Data Model (XDM) http://www.w3.org/TR/2007/REC-xpath-datamodel-20070123/ I can tell you for sure that XPath 2.0, which is the basis for XSLT 2.0 and XQuery 1.0, is truly a challenge to implement in languages like C++ and Java. Others have succeeded with implementations in languages like Eiffel. I would hope though, that D2 would be up to the task (is that is wishful thinking?). Cheers Justin Johansson
Jun 27 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/27/2010 10:16 AM, Justin Johansson wrote:
 OTOH, there are some really significant W3C specs that you may
 or may not be aware of and these are really difficult to implement
 in regular imperative languages like C/C++ and Java. Java,
 being all that is the following of Java I guess, has had the most
 success in implementing these specs.

 IMHO, the two most fundamental and significant W3C specs that
 D libraries could well address are as follows. These form a
 large amount of the (formal) XML ecosystem.

 XML Schema Part 2: Datatypes Second Edition
 http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

 and

 XQuery 1.0 and XPath 2.0 Data Model (XDM)
 http://www.w3.org/TR/2007/REC-xpath-datamodel-20070123/

 I can tell you for sure that XPath 2.0, which is the basis
 for XSLT 2.0 and XQuery 1.0, is truly a challenge to implement
 in languages like C++ and Java. Others have succeeded with
 implementations in languages like Eiffel. I would hope though,
 that D2 would be up to the task (is that is wishful thinking?).

 Cheers
 Justin Johansson

For the sake of us uninformed spectators, could you give a little taste of the challenges to which you refer?
Jun 27 2010
parent reply Justin Johansson <no spam.com> writes:
Ellery Newcomer wrote:
 On 06/27/2010 10:16 AM, Justin Johansson wrote:
 OTOH, there are some really significant W3C specs that you may
 or may not be aware of and these are really difficult to implement
 in regular imperative languages like C/C++ and Java. Java,
 being all that is the following of Java I guess, has had the most
 success in implementing these specs.

 IMHO, the two most fundamental and significant W3C specs that
 D libraries could well address are as follows. These form a
 large amount of the (formal) XML ecosystem.

 XML Schema Part 2: Datatypes Second Edition
 http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

 and

 XQuery 1.0 and XPath 2.0 Data Model (XDM)
 http://www.w3.org/TR/2007/REC-xpath-datamodel-20070123/

 I can tell you for sure that XPath 2.0, which is the basis
 for XSLT 2.0 and XQuery 1.0, is truly a challenge to implement
 in languages like C++ and Java. Others have succeeded with
 implementations in languages like Eiffel. I would hope though,
 that D2 would be up to the task (is that is wishful thinking?).

 Cheers
 Justin Johansson

For the sake of us uninformed spectators, could you give a little taste of the challenges to which you refer?

Writing an XML parser in itself is pretty much basic CS101 stuff. The tough challenges come with implementing the other W3C specs in the XML ecosystem, such as XSchema and XPath 2.0 for reason that these are such humongous and complex beasts. An XSchema implementation forms the basis for writing an XML content validator and that's a pretty important tool to have for a lot of XML processing. An XPath 2.0 implementation forms the core of XSLT 2.0 and XQuery which are XML transformation languages. Again these are very useful tools. The most successful implementations of XSchema and XPath 2.0 are written in Java. This is probably mostly due to the widespread popularity of Java and there being very many open source volunteers to do the grunt. If you look at any of the Java sources for these XML projects, you will be astounded just how big they are, like the Saxon Java XSLT processor by Michael Kay for example*. Of course you will be secretly thinking to yourself that the size these works would be considerably smaller if they were written in D :-) (*Michael Kay has spent the last ten years working on it.) In the C++ world of Qt, there is the Qt XmlPatterns library which implements XPath 2.0 which is also quite sizable and currently incomplete (implementing only about 70% of the W3C spec) and there are a whole bunch of (former TrollTech?) people at Nokia working on it, again demonstrating that implementing these W3C specs is no simple feat. If you are really interested, try downloading a copy of the Qt source from Nokia and take a look at the C++ code in the XmlPatterns library. From that you will surely get more than just a taste of the challenges, you will get a whole mouthful! :-) http://qt.nokia.com/downloads/ Cheers Justin Johansson
Jun 28 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 06/28/2010 08:13 AM, Justin Johansson wrote:
 If you look at any of the Java sources for these XML projects, you will
 be astounded just how big they are, like the Saxon Java XSLT processor
 by Michael Kay for example*. Of course you will be secretly thinking to
 yourself that the size these works would be considerably smaller if they
 were written in D :-)

 (*Michael Kay has spent the last ten years working on it.)

 In the C++ world of Qt, there is the Qt XmlPatterns library which
 implements XPath 2.0 which is also quite sizable and currently
 incomplete (implementing only about 70% of the W3C spec) and there
 are a whole bunch of (former TrollTech?) people at Nokia working on it,
 again demonstrating that implementing these W3C specs is no simple feat.

Sounds ominous. Like you'd need a serious team if you actually wanted to do any of this stuff.
Jun 29 2010
parent Justin Johansson <no spam.com> writes:
Ellery Newcomer wrote:
 On 06/28/2010 08:13 AM, Justin Johansson wrote:
 If you look at any of the Java sources for these XML projects, you will
 be astounded just how big they are, like the Saxon Java XSLT processor
 by Michael Kay for example*. Of course you will be secretly thinking to
 yourself that the size these works would be considerably smaller if they
 were written in D :-)

 (*Michael Kay has spent the last ten years working on it.)

 In the C++ world of Qt, there is the Qt XmlPatterns library which
 implements XPath 2.0 which is also quite sizable and currently
 incomplete (implementing only about 70% of the W3C spec) and there
 are a whole bunch of (former TrollTech?) people at Nokia working on it,
 again demonstrating that implementing these W3C specs is no simple feat.

Sounds ominous. Like you'd need a serious team if you actually wanted to do any of this stuff.

Yep, a serious team armed with a serious programming language that is capable of realizing an event horizon to explode the singularity that is the black hole of the W3C specs and ultimately creating works of shear beauty et ordo ab chao (and order out of chaos). D2? :-) Cheers Justin Johansson
Jun 29 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Justin Johansson wrote:
 Adam Ruppe wrote:
 I'm not terribly interested in it because I already wrote my own
 replacement: http://arsdnet.net/dcode/dom.d

 Mine is biased toward HTML, doing what I personally find useful, or
 mimicing what javascript in the browser would do instead of following
 the standard, but if there's anything in there that is useful to
 others, you're free to take it.

Thanks Adam for replying. I'm happy to take onboard contra-views such as yours as well. Naturally it is no point in putting in an effort wherein there is no interest at large. Still, I'll wait for more replies on this ng before making any decision whether or not to commit myself to a new "D2 XML" effort.

Clearly std.xml can't stay the way it is. I'm even thinking of removing it preemptively in wait for another implementation. If you want to work on something you enjoy, it seems like std.xml is a good choice. If you want to work on the top most important item, probably networking would come ahead. We badly need http and ftp streaming libraries. I'm thinking libcurl would be a good choice as a backend (not interface). For D integration, it would be great to integrate networking with std.stdio.File - e.g. creating File("http://xyz.org") would just connect to the thing and allow streaming, ranges, everything. Adam Ruppe has a lower-level networking protocol that also hooks into std.stdio.File, which would be very important to have too. But then it's often better to work on what you like, so don't look for a landslide vote. Ford didn't work on a faster horse etc. Some things that would be good to have in an xml library: - should work with input ranges (not only strings) - use aliases as lambdas if needed (std.xml's use of lambdas is nice, just very slow) - define templates for char, wchar, and dchar and then define one working with ranges of ubyte that dispatches depending on the encoding tag found. Andrei
Jun 27 2010
next sibling parent Justin Johansson <no spam.com> writes:
Andrei Alexandrescu wrote:
 Justin Johansson wrote:
 Adam Ruppe wrote:
 I'm not terribly interested in it because I already wrote my own
 replacement: http://arsdnet.net/dcode/dom.d

 Mine is biased toward HTML, doing what I personally find useful, or
 mimicing what javascript in the browser would do instead of following
 the standard, but if there's anything in there that is useful to
 others, you're free to take it.

Thanks Adam for replying. I'm happy to take onboard contra-views such as yours as well. Naturally it is no point in putting in an effort wherein there is no interest at large. Still, I'll wait for more replies on this ng before making any decision whether or not to commit myself to a new "D2 XML" effort.

Clearly std.xml can't stay the way it is. I'm even thinking of removing it preemptively in wait for another implementation. If you want to work on something you enjoy, it seems like std.xml is a good choice. If you want to work on the top most important item, probably networking would come ahead. We badly need http and ftp streaming libraries. I'm thinking libcurl would be a good choice as a backend (not interface). For D integration, it would be great to integrate networking with std.stdio.File - e.g. creating File("http://xyz.org") would just connect to the thing and allow streaming, ranges, everything. Adam Ruppe has a lower-level networking protocol that also hooks into std.stdio.File, which would be very important to have too. But then it's often better to work on what you like, so don't look for a landslide vote. Ford didn't work on a faster horse etc. Some things that would be good to have in an xml library: - should work with input ranges (not only strings) - use aliases as lambdas if needed (std.xml's use of lambdas is nice, just very slow) - define templates for char, wchar, and dchar and then define one working with ranges of ubyte that dispatches depending on the encoding tag found. Andrei

Thanks Andrei et. al. I'll get back to the topic after some sleep and another day at the office tomorrow; it's way after the witching hour now in my neck of the woods. Cheers, Justin
Jun 27 2010
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu Wrote:

 Justin Johansson wrote:
 Adam Ruppe wrote:
 I'm not terribly interested in it because I already wrote my own
 replacement: http://arsdnet.net/dcode/dom.d

 Mine is biased toward HTML, doing what I personally find useful, or
 mimicing what javascript in the browser would do instead of following
 the standard, but if there's anything in there that is useful to
 others, you're free to take it.

Thanks Adam for replying. I'm happy to take onboard contra-views such as yours as well. Naturally it is no point in putting in an effort wherein there is no interest at large. Still, I'll wait for more replies on this ng before making any decision whether or not to commit myself to a new "D2 XML" effort.

Clearly std.xml can't stay the way it is. I'm even thinking of removing it preemptively in wait for another implementation.

I'd like to cast a vote for a SAX-style parser. A DOM parser can be built on top of it, and frankly, a SAX parser the only kind I'd ever use. I'm either working with large streams where building a tree is impractical, or performance is enough of an issue that again, building a tree is impractical. I have similar feelings about the JSON parser despite it being a pretty solid implementation otherwise. I'd contribute one if I could, but I did one for work and it just isn't worth the administrative hassle.
Jun 27 2010
prev sibling parent Justin Johansson <no spam.com> writes:
Andrei Alexandrescu wrote:
 Justin Johansson wrote:
 Clearly std.xml can't stay the way it is. I'm even thinking of removing 
 it preemptively in wait for another implementation.

Others in this thread have suggested a preemptive removal of the current std.xml incarnation also. Please add my vote to this in agreement that it *must* go. It's current state is well beyond absolutely shocking and only serves to bring D into disrepute. It would be much better to say, "sorry, D does not have a standard XML library yet"*** and ask for help rather than leaving things as they are. ***Reminds me of an old saying, "it's better to keep your mouth shut and appear to be an idiot than to open it and remove all doubt". ( Of course, I do confess to opening mine once or twice too often ;-) ) Translated to std.xml, this means better not to have it at all rather than have what we currently have.
 If you want to work on something you enjoy, it seems like std.xml is a 
 good choice. If you want to work on the top most important item, 
 probably networking would come ahead. We badly need http and ftp 
 streaming libraries. I'm thinking libcurl would be a good choice as a 
 backend (not interface). For D integration, it would be great to 
 integrate networking with std.stdio.File - e.g. creating 
 File("http://xyz.org") would just connect to the thing and allow 
 streaming, ranges, everything. Adam Ruppe has a lower-level networking 
 protocol that also hooks into std.stdio.File, which would be very 
 important to have too.

Sure I do enjoy XML ecosystem stuff and by that I mean well beyond just simple parsing. OTOH, streaming libraries for http and ftp are an absolute necessity to underpin industrial strength all things both Unicode and XML so I can see why you put this at the top of the wish list. Robust streaming should have support not only for all popular network protocols but content (character) encodings as well. Since this thread has promoted a lot of ideas about std.xml, I think we would do well to start a new thread on streaming. Cheers Justin Johansson
 But then it's often better to work on what you like, so don't look for a 
 landslide vote. Ford didn't work on a faster horse etc. Some things that 
 would be good to have in an xml library:
 
 - should work with input ranges (not only strings)
 
 - use aliases as lambdas if needed (std.xml's use of lambdas is nice, 
 just very slow)
 
 - define templates for char, wchar, and dchar and then define one 
 working with ranges of ubyte that dispatches depending on the encoding 
 tag found.
 
 
 Andrei

Jun 29 2010
prev sibling next sibling parent Adam Ruppe <destructionator gmail.com> writes:
On 6/27/10, Justin Johansson <no spam.com> wrote:
 btw. I feel it fair to add conjecture that a DOM implementation
 is pretty basic stuff and that a complete XML ecosystem it much
 larger than just this (i.e. an in-memory DOM).

Yes, it is very simple, but so is all the XML I've ever actually encountered. I've seen ugly, convoluted HTML and I've seen name/value pairs in verbose XML format, but very very little in the middle. (Heck, I just used std.string.indexOf("<tagname") for quite a while.) This is probably due to my observation bias, with all my XML experience coming from working with web services.
Jun 27 2010
prev sibling next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
On Sun, 27 Jun 2010 16:55:56 +0200, Lutger wrote:

 Don't know really, sorry. From a quick glance though, I would say to
 start looking into xmlp and Adam Ruppe's code. xmlp even has conformance
 tests, perhaps you can work together with the author?
 
 http://www.dsource.org/projects/xmlp
 http://www.digitalmars.com/d/archives/digitalmars/D/

I needed a simple library for parsing XML and started using xmlp since I had too many workarounds in std.xml. I emailed the author awhile back about some possible changes (complained about namespaces when reading, I believe, a docx file. So I disabled the check). But I haven't heard back from him.
Jun 27 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Justin Johansson" <no spam.com> wrote in message 
news:i07jpn$tt$1 digitalmars.com...
 Justin Johansson wrote:
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos 
 library?  (Currently it looks like it needs to be started over from 
 scratch)

 Also what is the level of interest from library users for decent XML 
 support in D2/Phobos?

 Cheers
 Justin Johansson

Lutger said: Interested, very much so. I think many people are. Simen said: Absolutely. It is a necessity. Thanks for fast replies from Lutger and Simen. Being an XML/W3C addict myself I well concur with Simen and Lutger's sentiments. However, Lutger simply saying that he "thinks" *many* people are interested is not good enough for me. Would the *many* people also please add a voice along with Simen et. al. to inspire me to contribute to the D2 XML effort (of course under a Walter-endorsed style of licence).

I'm interested. I've been thinking of porting some of my stuff from D1/Tango to D2/Phobos (*not* because of political reasons or anything against Tango or any Tango team member), so anything that I'm using from Tango that doesn't have a good Phobos equivilent would be a roadblock. XML (reading) is one of those things.
Jun 27 2010
prev sibling next sibling parent Adam Ruppe <destructionator gmail.com> writes:
I'm not terribly interested in it because I already wrote my own
replacement: http://arsdnet.net/dcode/dom.d

Mine is biased toward HTML, doing what I personally find useful, or
mimicing what javascript in the browser would do instead of following
the standard, but if there's anything in there that is useful to
others, you're free to take it.
Jun 27 2010
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2010-06-27 12:34, Justin Johansson wrote:
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos
 library? (Currently it looks like it needs to be started over from scratch)

 Also what is the level of interest from library users for decent XML
 support in D2/Phobos?

 Cheers
 Justin Johansson

I would very much have XML support in Phobos, I think all standard libraries should have that. I'm currently working on porting the XML archive of my serialization library to D2, so I need XML support in Phobos. -- /Jacob Carlborg
Jun 27 2010
prev sibling next sibling parent jpf <spam example.com> writes:
On 27.06.2010 12:34, Justin Johansson wrote:
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos
 library?  (Currently it looks like it needs to be started over from
 scratch)
 
 Also what is the level of interest from library users for decent XML
 support in D2/Phobos?
 
 Cheers
 Justin Johansson

I would also like to have a better std.xml. I still have an old D1/Tango project I wanted to port to D2/Phobos but as long as there is no good XML library for D2 (and totally unrelated: a stable network api) that wouldn't make sense. -- Johannes Pfau
Jun 27 2010
prev sibling next sibling parent reply "Yao G." <nospamyao gmail.com> writes:
I did a simple implementation of a pull parser, using this API as  
reference: http://xmlpull.org/

But I used a iterator similar to the one used by Steve (from dcollections)  
to parse the doc. It turns out that Tango did something similar first  
(using iterator to parse the document), and seeing the debacle caused by  
the Date module, I think it would be a bad idea to release it.


Yao G.


On Sun, 27 Jun 2010 05:34:30 -0500, Justin Johansson <no spam.com> wrote:

 May I ask is anybody working on redeveloping std.xml in the D2/Phobos  
 library?  (Currently it looks like it needs to be started over from  
 scratch)

 Also what is the level of interest from library users for decent XML  
 support in D2/Phobos?

 Cheers
 Justin Johansson

-- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Jun 27 2010
next sibling parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 28/06/2010 13:04, Steven Schveighoffer wrote:
 On Sun, 27 Jun 2010 14:56:21 -0400, Yao G. <nospamyao gmail.com> wrote:

 I did a simple implementation of a pull parser, using this API as
 reference: http://xmlpull.org/

 But I used a iterator similar to the one used by Steve (from
 dcollections) to parse the doc. It turns out that Tango did something
 similar first (using iterator to parse the document), and seeing the
 debacle caused by the Date module, I think it would be a bad idea to
 release it.

Did you look at Tango's code in question, or look at their documentation? If not, then you are fine. I think any implementation is going to have to at least try to use ranges or show why they are not a good idea for xml, since Andrei is set on using ranges for everything. BTW, I've not used std.xml or tango's xml, but I agree that an xml library is a very important part of today's standard libraries. Having xml in the standard allows for so much usage of it in many other places (serialization comes to mind immediately). If std.xml is bad (which I've heard from several independent people), then throw it out and make something new. I myself have tried to think of how xml can be done with ranges, but I believe one of the key elements is it has to parse xml without loading the entire document to be efficient enough for some applications. A DOM style parser which presents a range interface is probably fine, but a lazy interface would be the best. Since XML is a tree style, you need a range which allows moving down the tree. You almost need a stacking range which can move down the tree and also to the next sibling element. Ideally, the library should do as much as possible without allocating anything but buffer space to read data. -Steve

I've not looked at any of the D XML offerings (shame on me?) but I've been having a bit of a look at the types of API that are available in other languages, and there seems to be 3... Event based a la SAX Stream based a la StAX Tree based a la "the" DOM The simple conclusion that I have drawn is that the is no one-size-fits-all solution, and that it would therefore be a mistake to put all effort into supporting only one. (However, ranges do seem to match up quite nicely with the way that the Stream based APIs operate.) It would seem to me most logical to consider the many varied use-cases and build a core API upon which all 3 types of XML processor can be built (or at least specify a core set of types to be used by all 3), rather than focus on implementing one particular style. Interoperability of all 3 styles would then be possible and perhaps facilitate the later implementation of higher abstractions (such as XPath and XQuery). I think it is also important to remember that there are at least 4 different stages to processing XML (reading, validating, mutating, writing) and that many programming tasks allow one or more of these aspects to be ignored. This can mean that one programmer is blinded to the requirements of another in a different domain because the ways in which they work with XML either overlap only partially or not at all. I've never used anything like SAX myself, though I have used the DOM quite a lot, and spent most of the time wishing it worked a bit more like StAX (even though I hadn't heard of StAX at the time ^^). What ever is done for D, it should allow programmers to work with XML in a way that is familiar to them and compatible with what others do. Memory should be used conservatively, and reprocessing (parsing the same portion of a document multiple times) should be minimised. Most importantly, the implementation should be D-ey, rather that the abstraction used in any other language's most favoured solution, shoehorned into a D-shaped box. A... (whose 2 cents are worth no more or no less than anyone else's.)
Jun 28 2010
parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 28/06/2010 15:11, Steven Schveighoffer wrote:

 Yes, I don't think the phobos solution needs to mimic exactly the API of
 SAX or DOM, the author should be free to use D idioms. But starting with
 a common proven design is probably a good idea.

 -Steve

I've been thinking about it, and while I believe you when you say that SAX can be used to build the DOM, I'm not convinced that SAX is the lowest common abstraction. Michel Fortin's Tokenizer/Range seems much closer to the metal to me. A...
Jun 29 2010
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-06-29 04:41:50 -0400, Alix Pexton <alix.DOT.pexton gmail.DOT.com> said:

 On 28/06/2010 15:11, Steven Schveighoffer wrote:
 
 Yes, I don't think the phobos solution needs to mimic exactly the API of
 SAX or DOM, the author should be free to use D idioms. But starting with
 a common proven design is probably a good idea.
 
 -Steve

I've been thinking about it, and while I believe you when you say that SAX can be used to build the DOM, I'm not convinced that SAX is the lowest common abstraction. Michel Fortin's Tokenizer/Range seems much closer to the metal to me.

It is closer to the metal, but there's a catch... One issue with SAX is that you must allocate an array of strings to pass the attributes of an element, which is probably going to need a dynamic allocation at some point. A lower-level abstraction such as mine (or Tango's pull-parser) just returns each attribute as a separate token as it parses them. The downside of the tokenizer interface is that it only checks for a subset of well-formness, for instance it doesn't check that tags balance each other correctly or that there is no two attributes with the same name. It's just a "tokenizer" after all, it can't be described as a conformant XML parser by itself. The upper layer parser needs to check for these things. My mini DOM built on this tokenizer does these checks when using the tokenizer, and it's more efficient to do them there because that's where the context information is kept, which is why the tokenizer doesn't do them. Implementing SAX on top of my tokenizer consists mostly of ensuring proper tag balancing, checking for duplicate attributes, and collecting attributes in an array (or another kind of list) you can then give to the openElement SAX callback. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jun 29 2010
parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 29/06/2010 13:27, Michel Fortin wrote:
 On 2010-06-29 04:41:50 -0400, Alix Pexton
 <alix.DOT.pexton gmail.DOT.com> said:

 On 28/06/2010 15:11, Steven Schveighoffer wrote:

 Yes, I don't think the phobos solution needs to mimic exactly the API of
 SAX or DOM, the author should be free to use D idioms. But starting with
 a common proven design is probably a good idea.

 -Steve

I've been thinking about it, and while I believe you when you say that SAX can be used to build the DOM, I'm not convinced that SAX is the lowest common abstraction. Michel Fortin's Tokenizer/Range seems much closer to the metal to me.

It is closer to the metal, but there's a catch... One issue with SAX is that you must allocate an array of strings to pass the attributes of an element, which is probably going to need a dynamic allocation at some point. A lower-level abstraction such as mine (or Tango's pull-parser) just returns each attribute as a separate token as it parses them. The downside of the tokenizer interface is that it only checks for a subset of well-formness, for instance it doesn't check that tags balance each other correctly or that there is no two attributes with the same name. It's just a "tokenizer" after all, it can't be described as a conformant XML parser by itself. The upper layer parser needs to check for these things. My mini DOM built on this tokenizer does these checks when using the tokenizer, and it's more efficient to do them there because that's where the context information is kept, which is why the tokenizer doesn't do them. Implementing SAX on top of my tokenizer consists mostly of ensuring proper tag balancing, checking for duplicate attributes, and collecting attributes in an array (or another kind of list) you can then give to the openElement SAX callback.

My understanding was that SAX _doesn't_ check those things either and that it was up to the code responding to the events to tackle wellformedness. After all, if SAX handled wellformedness, there would be no need for it to pass an argument to closeElement to state what element was being closed. SAX has its place though, when it comes to doing a single pass filter on a stream of XML that can be assumed to be wellformed, its simplicity is admittedly hard to beat. In other applications, however, there is much room for improvement. SAXplus, with a built in element memoisation, an element stack and a used id list sounds quite useful to me, as long as they remain optional of course. Admittedly, my initial disappointment when looking into SAX means that it is something that I have not followed for some time. Hmn, I suddenly just got nostalgic for the days when XML was all shiney and new and everyone was writing their own APIs or butchering old SGML/HTML tech. Makes me want to go look at my old code ^^ A...
Jun 29 2010
prev sibling parent BLS <windevguy hotmail.de> writes:
On 28/06/2010 14:04, Steven Schveighoffer wrote:
 I myself have tried to think of how xml can be done with ranges, but I
 believe one of the key elements is it has to parse xml without loading
 the entire document to be efficient enough for some applications.  A DOM
 style parser which presents a range interface is probably fine, but a
 lazy interface would be the best.  Since XML is a tree style, you need a
 range which allows moving down the tree.  You almost need a stacking
 range which can move down the tree and also to the next sibling
 element.  Ideally, the library should do as much as possible without
 allocating anything but buffer space to read data.

 -Steve

Hi Steve, Philippe Sigaud has written a very interesting lib. called dranges. http://www.dsource.org/projects/dranges/wiki I think treerange.d and graphrange.d are an excellent source of inspiration. Bjoern
Jun 28 2010
prev sibling next sibling parent Bernard Helyer <b.helyer gmail.com> writes:
On Sun, 27 Jun 2010 20:04:30 +0930, Justin Johansson wrote:

 May I ask is anybody working on redeveloping std.xml in the D2/Phobos
 library?  (Currently it looks like it needs to be started over from
 scratch)
 
 Also what is the level of interest from library users for decent XML
 support in D2/Phobos?
 
 Cheers
 Justin Johansson

std.xml needs to be replaced, but I personally don't much care as kxml fits my needs nicely: http://opticron.no-ip.org/svn/branches/kxml
Jun 27 2010
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-06-27 07:04:30 -0400, Justin Johansson <no spam.com> said:

 May I ask is anybody working on redeveloping std.xml in the D2/Phobos 
 library?  (Currently it looks like it needs to be started over from 
 scratch)
 
 Also what is the level of interest from library users for decent XML 
 support in D2/Phobos?

I have made my own parser, comprised of a tokenizer and a mini DOM layer. I'm not sure how to qualify the tokenizer: it's mainly based on callbacks like an event parser, but a callback can decide to stop the parsing process and return to the original caller of the tokenizer (which can later restart parsing), it can choose to continue parsing the next token, or to recursively continue to run the parser using a different set of callbacks. From there it's trivial to efficiently implement a pull parser or a SAX parser, but the way callbacks can recursively call the tokenizer allows greater flexibility than those two models. The mini DOM I've made is based on this tokenizer, but is quite ordinary in comparison. Here's the generated documentation: http://michelf.com/docs/d/mfr/xmltok.html http://michelf.com/docs/d/mfr/xml.html I'm slowly revamping it to use ranges instead of strings. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jun 28 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Michel Fortin wrote:
 On 2010-06-27 07:04:30 -0400, Justin Johansson <no spam.com> said:
 
 May I ask is anybody working on redeveloping std.xml in the D2/Phobos 
 library?  (Currently it looks like it needs to be started over from 
 scratch)

 Also what is the level of interest from library users for decent XML 
 support in D2/Phobos?

I have made my own parser, comprised of a tokenizer and a mini DOM layer. I'm not sure how to qualify the tokenizer: it's mainly based on callbacks like an event parser, but a callback can decide to stop the parsing process and return to the original caller of the tokenizer (which can later restart parsing), it can choose to continue parsing the next token, or to recursively continue to run the parser using a different set of callbacks. From there it's trivial to efficiently implement a pull parser or a SAX parser, but the way callbacks can recursively call the tokenizer allows greater flexibility than those two models. The mini DOM I've made is based on this tokenizer, but is quite ordinary in comparison. Here's the generated documentation: http://michelf.com/docs/d/mfr/xmltok.html http://michelf.com/docs/d/mfr/xml.html I'm slowly revamping it to use ranges instead of strings.

I think a tokenizer should be a higher-order range that is fed an input range of ubyte, char, wchar, or dchar (so that would be a type parameter) and is itself a range of Tokens that include the token type, token value etc. Andrei
Jun 28 2010
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-06-28 14:27:13 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Here's the generated documentation:
 
 http://michelf.com/docs/d/mfr/xmltok.html
 http://michelf.com/docs/d/mfr/xml.html
 
 I'm slowly revamping it to use ranges instead of strings.

I think a tokenizer should be a higher-order range that is fed an input range of ubyte, char, wchar, or dchar (so that would be a type parameter) and is itself a range of Tokens that include the token type, token value etc.

And I've implemented a tokenizer range just like you describe on top of my tokenizer function. Look at the documentation for mfr.xmltok.XMLForwardRange. (I should probably rename it to XMLTokenRange.) Personally, I prefer to use the callback approach which automatically calls the right function according to the token type. But what's nice about my tokenizer is that you can do both callbacks and pull-style tokenization (the later can be wrapped in a range), and mix these approaches together as needed. What is missing is taking arbitrary ranges as input (it deals with strings currently). Strings are like the optimized case for tokenization because you don't have to dynamically allocate anything: referencing the original string is enough when making substrings. With arbitrary ranges you have to copy the text and tag names to a string one character at a time, which is less efficient. I don't want to write two separate parsers for this, so I'm trying to abstract things at the right level to maximize code reuse while keeping performance optimized for the string-as-input case, but how to do that is not so obvious. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jun 28 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 27 Jun 2010 14:56:21 -0400, Yao G. <nospamyao gmail.com> wrote:

 I did a simple implementation of a pull parser, using this API as  
 reference: http://xmlpull.org/

 But I used a iterator similar to the one used by Steve (from  
 dcollections) to parse the doc. It turns out that Tango did something  
 similar first (using iterator to parse the document), and seeing the  
 debacle caused by the Date module, I think it would be a bad idea to  
 release it.

Did you look at Tango's code in question, or look at their documentation? If not, then you are fine. I think any implementation is going to have to at least try to use ranges or show why they are not a good idea for xml, since Andrei is set on using ranges for everything. BTW, I've not used std.xml or tango's xml, but I agree that an xml library is a very important part of today's standard libraries. Having xml in the standard allows for so much usage of it in many other places (serialization comes to mind immediately). If std.xml is bad (which I've heard from several independent people), then throw it out and make something new. I myself have tried to think of how xml can be done with ranges, but I believe one of the key elements is it has to parse xml without loading the entire document to be efficient enough for some applications. A DOM style parser which presents a range interface is probably fine, but a lazy interface would be the best. Since XML is a tree style, you need a range which allows moving down the tree. You almost need a stacking range which can move down the tree and also to the next sibling element. Ideally, the library should do as much as possible without allocating anything but buffer space to read data. -Steve
Jun 28 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 28 Jun 2010 09:59:45 -0400, Alix Pexton  
<alix.DOT.pexton gmail.dot.com> wrote:

 I've never used anything like SAX myself, though I have used the DOM  
 quite a lot, and spent most of the time wishing it worked a bit more  
 like StAX (even though I hadn't heard of StAX at the time ^^).

DOM is usually built on top of SAX, so start with the lowest common denominator.
 What ever is done for D, it should allow programmers to work with XML in  
 a way that is familiar to them and compatible with what others do.  
 Memory should be used conservatively, and reprocessing (parsing the same  
 portion of a document multiple times) should be minimised.

Parsing multiple times should be minimized, but more important than that, allocations should be minimal. Nothing kills a good parsing/input algorithm's performance in D than overuse of the GC. Tango goes as far as having you pass in stack buffers to avoid even allocating buffers (not sure about it's xml lib, but knowing the rest of the lib, probably), I don't think std.xml has to go that far.
 Most importantly, the implementation should be D-ey, rather that the  
 abstraction used in any other language's most favoured solution,  
 shoehorned into a D-shaped box.

Yes, I don't think the phobos solution needs to mimic exactly the API of SAX or DOM, the author should be free to use D idioms. But starting with a common proven design is probably a good idea. -Steve
Jun 28 2010
prev sibling parent lurker <lurker mailinator.com> writes:
I'm very interested.

Tango's XML code was very good and damn fast. Maybe license issues can be worked
out for that part at least?
Jun 28 2010