digitalmars.D - The XML module in Phobos

llee (1/1) Jul 27 2009 The std.xml module contains several bugs that need to be fixed. The most...

llee (2/3) Jul 27 2009 I should also point out that the std.xml module does not have support fo...
Kagamin (2/3) Jul 27 2009 Developer and maintainer was Janice Caron, but she(?) was inactive for s...
Lars T. Kyllingstad (4/5) Jul 28 2009 According to Andrei, std.xml is in line for a complete rewrite.

Andrei Alexandrescu (3/14) Jul 28 2009 Yes, I think that will be necessary. Any volunteers? :o)

Daniel Keep (3/20) Jul 28 2009 There is already a high-performance one in Tango. There must be some

Andrei Alexandrescu (4/24) Jul 28 2009 Unfortunately I'm not seeing any. Besides, it would be great if phobos'

bearophile (6/8) Jul 28 2009 Translated, it becomes:
Michel Fortin (14/23) Jul 28 2009 I started something which could be used as a replacement of std.xml.

Ary Borenszweig (3/25) Jul 28 2009 But *why* use or make another one when the Tango one is already

Adam D. Ruppe (5/7) Jul 28 2009 Copyright.

Michel Fortin (7/12) Jul 28 2009 That, and because there's some fun in doing it. Anyway, this is just
language_fan (13/18) Jul 28 2009 There are most likely several issues that prevent the reuse of that code...

Ary Borenszweig (5/22) Jul 28 2009 Yes, there are:

Michel Fortin (14/28) Jul 28 2009 That's true, Tango's parser is simple and well done, and it's using

Lutger (16/36) Jul 28 2009 Naming conventions by Tango is quite similar to the style guidelines tha...

Kagamin (2/4) Jul 28 2009 Isn't it high-performance at the cost of not complaining to the DOM spec...

Michael Rynn (22/23) Jul 30 2009 I did look at the code for the xml module, and posted a suggested bug

Michael Rynn (5/8) Jul 30 2009 On Thu, 30 Jul 2009 18:03:39 +1000, Michael Rynn
Andrei Alexandrescu (7/40) Jul 30 2009 It would be great if you could contribute to Phobos. Two things I hope

Benji Smith (13/20) Jul 30 2009 Interesting. Most XML parsers either produce a "Document" object, or

Daniel Keep (43/47) Jul 30 2009 There's really only one sane way to map XML parsing to ranges: pull

zsxxsz (5/11) Jul 30 2009 I agree with it. Net IO maybe be blocked so often, then the program will
Andrei Alexandrescu (16/54) Jul 30 2009 Looks great. The network I/O could run separately too, in a

Daniel Keep (62/81) Jul 30 2009 (A clarification: I *should* have said "...basing IO entirely on ranges

Steven Schveighoffer (12/22) Jul 31 2009 These problems have been discussed before, I hope they can be solved. I...
Andrei Alexandrescu (31/89) Jul 31 2009 Yah, we had to choose popFront instead of the shorter next because there...

Michel Fortin (20/44) Jul 31 2009 A range is mostly a list of things. In the example above, doc.select

Benji Smith (5/34) Jul 31 2009 But XML documents aren't really lists. They're trees.

Michel Fortin (18/23) Aug 01 2009 Well, it depends at what level you look. An XML document you read is

Benji Smith (10/33) Aug 01 2009 Oh sure. I agree that a range-based way of iterating over tokens is

Michael Rynn (35/35) Aug 04 2009 On Sun, 02 Aug 2009 00:25:20 -0400, Benji Smith

Michel Fortin (24/28) Aug 04 2009 Exactly what I've been working on:

llee <llee jhsph.edu> writes:

The std.xml module contains several bugs that need to be fixed. The most
important one is that the parser fails to parse empty elements (IE elements
that use the <tag name="value" /> format). I'd like to report this bug to the
modules' maintainer, but I don't know who to contact. (This is an old bug -
it's been around for at least a year and I'm surprised that it has not been
fixed).

Jul 27 2009

llee <llee jhsph.edu> writes:

llee Wrote:

 The std.xml module contains several bugs that need to be fixed. The most
important one is that the parser fails to parse empty elements (IE elements
that use the <tag name="value" /> format). I'd like to report this bug to the
modules' maintainer, but I don't know who to contact. (This is an old bug -
it's been around for at least a year and I'm surprised that it has not been
fixed).

I should also point out that the std.xml module does not have support for
namespaces.

Jul 27 2009

Kagamin <spam here.lot> writes:

llee Wrote:

 The std.xml module contains several bugs that need to be fixed. The most
important one is that the parser fails to parse empty elements (IE elements
that use the <tag name="value" /> format). I'd like to report this bug to the
modules' maintainer, but I don't know who to contact. (This is an old bug -
it's been around for at least a year and I'm surprised that it has not been
fixed).

Developer and maintainer was Janice Caron, but she(?) was inactive for some
time.

Jul 27 2009

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

llee wrote:
 The std.xml module contains several bugs that need to be fixed. The most
important one is that the parser fails to parse empty elements (IE elements
that use the <tag name="value" /> format). I'd like to report this bug to the
modules' maintainer, but I don't know who to contact. (This is an old bug -
it's been around for at least a year and I'm surprised that it has not been
fixed).

According to Andrei, std.xml is in line for a complete rewrite.

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=93547

-Lars

Jul 28 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Lars T. Kyllingstad wrote:
 llee wrote:
 The std.xml module contains several bugs that need to be fixed. The 
 most important one is that the parser fails to parse empty elements 
 (IE elements that use the <tag name="value" /> format). I'd like to 
 report this bug to the modules' maintainer, but I don't know who to 
 contact. (This is an old bug - it's been around for at least a year 
 and I'm surprised that it has not been fixed).

 
 According to Andrei, std.xml is in line for a complete rewrite.
 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=93547 

Yes, I think that will be necessary. Any volunteers? :o)

Andrei

Jul 28 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Andrei Alexandrescu wrote:
 Lars T. Kyllingstad wrote:
 llee wrote:
 The std.xml module contains several bugs that need to be fixed. The
 most important one is that the parser fails to parse empty elements
 (IE elements that use the <tag name="value" /> format). I'd like to
 report this bug to the modules' maintainer, but I don't know who to
 contact. (This is an old bug - it's been around for at least a year
 and I'm surprised that it has not been fixed).

 According to Andrei, std.xml is in line for a complete rewrite.

 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=93547

 
 
 Yes, I think that will be necessary. Any volunteers? :o)
 
 Andrei

There is already a high-performance one in Tango.  There must be some
way to avoid duplicating effort.

Jul 28 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Daniel Keep wrote:
 
 Andrei Alexandrescu wrote:
 Lars T. Kyllingstad wrote:
 llee wrote:
 The std.xml module contains several bugs that need to be fixed. The
 most important one is that the parser fails to parse empty elements
 (IE elements that use the <tag name="value" /> format). I'd like to
 report this bug to the modules' maintainer, but I don't know who to
 contact. (This is an old bug - it's been around for at least a year
 and I'm surprised that it has not been fixed).

 According to Andrei, std.xml is in line for a complete rewrite.

 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=93547

 Yes, I think that will be necessary. Any volunteers? :o)

 Andrei

 
 There is already a high-performance one in Tango.  There must be some
 way to avoid duplicating effort.

Unfortunately I'm not seeing any. Besides, it would be great if phobos' 
xml would work with ranges.

Andrei

Jul 28 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
Unfortunately I'm not seeing any.<

There's a simple solution, future D2 programmers will have both libs installed,
so they will just use the Tango XML reader. So the best solution is to remove
the XML reader from Phobos. The idea is to remove most inter-library
redundancy. The sooner such problem is solved, the better it will be for D2.


Besides, it would be great if phobos' xml would work with ranges.<

Translated, it becomes:
"Besides, it would be great if tangos' xml would work with ranges.". I guess
Tango2 will use ranges.

Bye,
bearophile

Jul 28 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-07-28 10:09:04 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Yes, I think that will be necessary. Any volunteers? :o)
 
 Andrei

 
 There is already a high-performance one in Tango.  There must be some
 way to avoid duplicating effort.

 
 Unfortunately I'm not seeing any. Besides, it would be great if phobos' 
 xml would work with ranges.

I started something which could be used as a replacement of std.xml. 
Once it's a little more ready, I could contribute it to Phobos. It 
includes two tokenizer APIs (a templated callback tokenizer and a range 
tokenizer), and one DOM (not really based on the W3C DOM). The goal is 
to support all of XML at the tokenizer level, but skipping the internal 
subset of the doctype in the DOM.

Currently, the range API would be more useful if there was a way to 
switch on the type of an Algebraic.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jul 28 2009

Ary Borenszweig <ary esperanto.org.ar> writes:

Michel Fortin wrote:
 On 2009-07-28 10:09:04 -0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 Yes, I think that will be necessary. Any volunteers? :o)

 Andrei

 There is already a high-performance one in Tango.  There must be some
 way to avoid duplicating effort.

 Unfortunately I'm not seeing any. Besides, it would be great if 
 phobos' xml would work with ranges.

 
 I started something which could be used as a replacement of std.xml. 
 Once it's a little more ready, I could contribute it to Phobos. It 
 includes two tokenizer APIs (a templated callback tokenizer and a range 
 tokenizer), and one DOM (not really based on the W3C DOM). The goal is 
 to support all of XML at the tokenizer level, but skipping the internal 
 subset of the doctype in the DOM.
 
 Currently, the range API would be more useful if there was a way to 
 switch on the type of an Algebraic.

But *why* use or make another one when the Tango one is already 
excellent? :(

Jul 28 2009

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Tue, Jul 28, 2009 at 12:23:50PM -0300, Ary Borenszweig wrote:
 But *why* use or make another one when the Tango one is already 
 excellent? :(

Copyright.

-- 
Adam D. Ruppe
http://arsdnet.net

Jul 28 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-07-28 11:38:36 -0400, "Adam D. Ruppe" <destructionator gmail.com> said:

 On Tue, Jul 28, 2009 at 12:23:50PM -0300, Ary Borenszweig wrote:
 But *why* use or make another one when the Tango one is already
 excellent? :(

 
 Copyright.

That, and because there's some fun in doing it. Anyway, this is just 
practice before writing an HTML5 parser. ;-)

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jul 28 2009

language_fan <foo bar.com.invalid> writes:

Tue, 28 Jul 2009 11:38:36 -0400, Adam D. Ruppe thusly wrote:

 On Tue, Jul 28, 2009 at 12:23:50PM -0300, Ary Borenszweig wrote:
 But *why* use or make another one when the Tango one is already
 excellent? :(

 
 Copyright.

There are most likely several issues that prevent the reuse of that code. 
First, the indentation, module boundaries, and naming conventions may 
differ (tabs vs spaces, 4 vs 8 spaces, camelCase vs foo_bar etc.).

Next, does it use the slow object oriented approach like the rest of 
Tango (and unlike Phobos, which uses a very lightweight procedural 
model). Are there any benchmark results that show the approach Tango uses 
is any good, i.e. more performant than the ones for Java and C++ (even 
with larger xml documents). If it is, then the idea can be copied to 
Phobos as well.

Finally, the copyright is a problem unless it is handed over to 
digitalmars. Otherwise it might get troublesome to sell D later for 
commercial use when Phobos becomes the Standard library for D 2.0.

Jul 28 2009

Ary Borenszweig <ary esperanto.org.ar> writes:

language_fan wrote:
 Tue, 28 Jul 2009 11:38:36 -0400, Adam D. Ruppe thusly wrote:
 
 On Tue, Jul 28, 2009 at 12:23:50PM -0300, Ary Borenszweig wrote:
 But *why* use or make another one when the Tango one is already
 excellent? :(

 Copyright.

 
 There are most likely several issues that prevent the reuse of that code. 
 First, the indentation, module boundaries, and naming conventions may 
 differ (tabs vs spaces, 4 vs 8 spaces, camelCase vs foo_bar etc.).
 
 Next, does it use the slow object oriented approach like the rest of 
 Tango (and unlike Phobos, which uses a very lightweight procedural 
 model). Are there any benchmark results that show the approach Tango uses 
 is any good, i.e. more performant than the ones for Java and C++ (even 
 with larger xml documents). If it is, then the idea can be copied to 
 Phobos as well.

Yes, there are:

http://dotnot.org/blog/archives/2008/02/

And you can see they are pretty good. The object oriented approach is 
not a problem.

Jul 28 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-07-28 12:03:47 -0400, Ary Borenszweig <ary esperanto.org.ar> said:

 language_fan wrote:
 Tue, 28 Jul 2009 11:38:36 -0400, Adam D. Ruppe thusly wrote:
 
 Are there any benchmark results that show the approach Tango uses is 
 any good, i.e. more performant than the ones for Java and C++ (even 
 with larger xml documents). If it is, then the idea can be copied to 
 Phobos as well.

 
 Yes, there are:
 
 http://dotnot.org/blog/archives/2008/02/
 
 And you can see they are pretty good. The object oriented approach is 
 not a problem.

That's true, Tango's parser is simple and well done, and it's using 
final (thus non-virtual) functions. It being object-oriented only has a 
negligeable impact when you instanciate the parser.

I'm not writing my own parser because of any flaw in the Tango parser. 
I'm aiming at providing some features not found in Tango (like optional 
checking for well-formness) without compromizing on performance when 
you don't need them (templates are good for that). I'll also try to 
outperform Tango with callback parsing, but I expect it can only be 
done by a tiny margin, if at all.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jul 28 2009

Lutger <lutger.blijdestijn gmail.com> writes:

language_fan wrote:

 Tue, 28 Jul 2009 11:38:36 -0400, Adam D. Ruppe thusly wrote:
 
 On Tue, Jul 28, 2009 at 12:23:50PM -0300, Ary Borenszweig wrote:
 But *why* use or make another one when the Tango one is already
 excellent? :(

 
 Copyright.

 
 There are most likely several issues that prevent the reuse of that code.
 First, the indentation, module boundaries, and naming conventions may
 differ (tabs vs spaces, 4 vs 8 spaces, camelCase vs foo_bar etc.).

Naming conventions by Tango is quite similar to the style guidelines that 
Walter Bright has written, probably closer than phobos. As for formatting, 
you know, there are tools for that and descent even has the best formatter 
ever.
 
 Next, does it use the slow object oriented approach like the rest of
 Tango (and unlike Phobos, which uses a very lightweight procedural
 model). Are there any benchmark results that show the approach Tango uses
 is any good, i.e. more performant than the ones for Java and C++ (even
 with larger xml documents). If it is, then the idea can be copied to
 Phobos as well.

Object-oriented does not mean slow. Tango's XML library outperforms the 
fastest C++ libraries, here are some benchmarks:
http://dotnot.org/blog/archives/2008/03/10/xml-benchmarks-updated-graphs-
with-rapidxml/ 
 
 Finally, the copyright is a problem unless it is handed over to
 digitalmars. Otherwise it might get troublesome to sell D later for
 commercial use when Phobos becomes the Standard library for D 2.0.

I don't think (and hope) that walter bright & co will sell the standard 
library commercially, if that's even possible with current copyright owners.  
All that is needed is a license Walter Bright can live with, such as the 
boost one. 

Seems like an excellent opportunity for leveraging open source, no?

Jul 28 2009

Kagamin <spam here.lot> writes:

Daniel Keep Wrote:

 There is already a high-performance one in Tango.  There must be some
 way to avoid duplicating effort.

Isn't it high-performance at the cost of not complaining to the DOM
specification?

Jul 28 2009

Michael Rynn <michaelrynn optushome.com.au> writes:

On Mon, 27 Jul 2009 20:15:46 -0400, llee <llee jhsph.edu> wrote:

The std.xml module contains several bugs that need to be fixed. The most
important one is that the parser fails to parse empty elements (IE elements
that use the <tag name="value" /> format). I'd like to report this bug to the
modules' maintainer, but I don't know who to contact. (This is an old bug -
it's been around for at least a year and I'm surprised that it has not been
fixed).

I did look at the code for the xml module, and posted a suggested bug
fix to the empty elements problem. I do not have access rights to
updating the source repository, and at the time was too busy for this.

Now I am a state of my work position had been recently  made
redundant, and I would like to be considered for improving the std xml
module, or at least find out what I would have to do to be up to
scratch for this.

There are other possibilities of course, if you want a quick and ready
xml parser.

I had little trouble in compiling a static library version of the
Expat 2.01 ( probably oldish now), and linking D code to this.

I also made an attempt at creating D interfaces to the libxml windows
DLLs.  Because when I run CodeBlocks in debug mode, with ddbg, their
is a crash if another DLL is linked in via an import module.  

If not the xml module, then perhaps some lesser D library project to
get warmed up on.

I will have some time on my hands now,but perhaps  not as much as I
might want to think, because my post-redundacy workshops and resume
preparation are telling me that finding a job is full time occupation.

In any case I could do something with std.xml  ( for D2.0 ).

Pity I seen  jobs yet offering for D language programmers.

Jul 30 2009

Michael Rynn <michaelrynn optushome.com.au> writes:

On Thu, 30 Jul 2009 18:03:39 +1000, Michael Rynn
<michaelrynn optushome.com.au> wrote:
corrections..
I had little trouble in compiling a static library version of the
Expat 2.01

Whoops, I used an import library to the LibExpat.dll.
Pity I seen  jobs yet offering for D language programmers.

Any jobs in Sydney Australia for D language programmers..?

Jul 30 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michael Rynn wrote:
 On Mon, 27 Jul 2009 20:15:46 -0400, llee <llee jhsph.edu> wrote:
 
 The std.xml module contains several bugs that need to be fixed. The most
important one is that the parser fails to parse empty elements (IE elements
that use the <tag name="value" /> format). I'd like to report this bug to the
modules' maintainer, but I don't know who to contact. (This is an old bug -
it's been around for at least a year and I'm surprised that it has not been
fixed).

 
 I did look at the code for the xml module, and posted a suggested bug
 fix to the empty elements problem. I do not have access rights to
 updating the source repository, and at the time was too busy for this.
 
 Now I am a state of my work position had been recently  made
 redundant, and I would like to be considered for improving the std xml
 module, or at least find out what I would have to do to be up to
 scratch for this.
 
 There are other possibilities of course, if you want a quick and ready
 xml parser.
 
 I had little trouble in compiling a static library version of the
 Expat 2.01 ( probably oldish now), and linking D code to this.
 
 I also made an attempt at creating D interfaces to the libxml windows
 DLLs.  Because when I run CodeBlocks in debug mode, with ddbg, their
 is a crash if another DLL is linked in via an import module.  
 
 If not the xml module, then perhaps some lesser D library project to
 get warmed up on.
 
 I will have some time on my hands now,but perhaps  not as much as I
 might want to think, because my post-redundacy workshops and resume
 preparation are telling me that finding a job is full time occupation.
 
 In any case I could do something with std.xml  ( for D2.0 ).
 
 Pity I seen  jobs yet offering for D language programmers.

It would be great if you could contribute to Phobos. Two things I hope 
from any replacement (a) works with ranges and ideally outputs ranges, 
(b) uses alias functions instead of delegates if necessary.

Best of luck with your job search. As one looking for a job myself, 
yeah, it's a lot of work.


Andrei

Jul 30 2009

Benji Smith <dlanguage benjismith.net> writes:

 Michael Rynn wrote:
 I did look at the code for the xml module, and posted a suggested bug
 fix to the empty elements problem. I do not have access rights to
 updating the source repository, and at the time was too busy for this.


Andrei Alexandrescu wrote:
 It would be great if you could contribute to Phobos. Two things I hope 
 from any replacement (a) works with ranges and ideally outputs ranges, 
 (b) uses alias functions instead of delegates if necessary.

Interesting. Most XML parsers either produce a "Document" object, or 
they just execute SAX callbacks. If an XML parser returned a range 
object, how would you use it?

Usually, I use something like XPath to extract information from an XML 
doc. Something liek this:

    auto doc = parser.parse(xml);
    auto nodes = doc.select("/root//whatever[0][ id]");

I can see how you might do depth-first or breadth-first traversal of the 
DOM tree, or inorder traversal of the SAX events, with a range. But 
that's now how most people use XML. Are there are other range tricks up 
your sleeve that would support the a DOM or XPath kind of model?

--benji

Jul 30 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

 Andrei Alexandrescu wrote:
 It would be great if you could contribute to Phobos. Two things I hope
 from any replacement (a) works with ranges and ideally outputs ranges,
 (b) uses alias functions instead of delegates if necessary.


There's really only one sane way to map XML parsing to ranges: pull
parsing, which is more or less already a range.  For those unfamiliar
with it, this is how you use Tango's pull parser right now:

auto pp = new PullParser!(char)(xmlSource);

for( auto tt = pp.next; tt != XmlTokenType.Done; tt = pp.next )
{
    switch( tt )
    {
        case XmlTokenType.Attribute: ... break;
        case XmlTokenType.CData: ... break;
        case XmlTokenType.Comment: ... break;
        case XmlTokenType.Data: ... break;
        ...
        case XmlTokenType.StartElement: ... break;
        default: assert(false, "wtf?");
    }
}

This would fairly naturally map to a range of parsing events and look
something like:

foreach( event ; new PullParser!(char)(xmlSource) )
{
    switch( event.type )
    {
        /* again with the cases */
    }
}

Of course, most people HATE this method because it requires you to write
mountains of boilerplate code.  Pity, then, it's also the fastest and
most flexible.  :P  (It's a pity D doesn't have extension methods since
then you could probably do something along the lines of LINQ to make the
whole thing utterly painless... but then, I've given up on waiting for
that.)

This is basically the only way to map xml parsing to ranges.  As for
CONSUMING ranges, I think that'd be a bad idea for the same reason
basing IO entirely on ranges is a bad idea.

The only other use for ranges I can think of is one already mentioned by
Benji: traversal of a DOM.  Ranges don't apply to SAX because that's
what pull parsing is. :D

To Andrei: I sometimes worry that your... enthusiasm for ranges is going
to leave us with range-based APIs that don't make any sense or are
horribly slow (IO in particular has me worried).  But then, I suppose
that also makes you the perfect person to figure out where they CAN be used.

Plus, that way it's your fault if it doesn't work out.  :P

Jul 30 2009

zsxxsz <zhengshuxin hexun.com> writes:

== Quote from Daniel Keep (daniel.keep.lists gmail.com)'s article

 This is basically the only way to map xml parsing to ranges.  As for
 CONSUMING ranges, I think that'd be a bad idea for the same reason
 basing IO entirely on ranges is a bad idea.
 The only other use for ranges I can think of is one already mentioned by
 Benji: traversal of a DOM.  Ranges don't apply to SAX because that's
 what pull parsing is. :D

I agree with it. Net IO maybe be blocked so often, then the program will
pause until there is some data arrived. And when I want to write a no-block
http server with one thread running, the range IO will also block it, the
other socket with data ready will also be blocked.

Jul 30 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Daniel Keep wrote:
 
 Andrei Alexandrescu wrote:
 It would be great if you could contribute to Phobos. Two things I hope
 from any replacement (a) works with ranges and ideally outputs ranges,
 (b) uses alias functions instead of delegates if necessary.


 
 There's really only one sane way to map XML parsing to ranges: pull
 parsing, which is more or less already a range.  For those unfamiliar
 with it, this is how you use Tango's pull parser right now:

[snip]
 This would fairly naturally map to a range of parsing events and look
 something like:
 
 foreach( event ; new PullParser!(char)(xmlSource) )
 {
     switch( event.type )
     {
         /* again with the cases */
     }
 }

Looks great. The network I/O could run separately too, in a 
consumer/producer fasion.

 Of course, most people HATE this method because it requires you to write
 mountains of boilerplate code.  Pity, then, it's also the fastest and
 most flexible.  :P  (It's a pity D doesn't have extension methods since
 then you could probably do something along the lines of LINQ to make the
 whole thing utterly painless... but then, I've given up on waiting for
 that.)
 
 This is basically the only way to map xml parsing to ranges.  As for
 CONSUMING ranges, I think that'd be a bad idea for the same reason
 basing IO entirely on ranges is a bad idea.

Interesting. Could you please give more details about this? Why is 
range-based I/O a bad idea, and what can we do to make it a better one?

And what's the way that avoids writing boilerplate code but is slower? 
Is that the method that calls virtual functions (or delegates) upon each 
element received?

 The only other use for ranges I can think of is one already mentioned by
 Benji: traversal of a DOM.  Ranges don't apply to SAX because that's
 what pull parsing is. :D

Yah, I was thinking of the DOM traversal too. Yum.

 To Andrei: I sometimes worry that your... enthusiasm for ranges is going
 to leave us with range-based APIs that don't make any sense or are
 horribly slow (IO in particular has me worried).  But then, I suppose
 that also makes you the perfect person to figure out where they CAN be used.
 
 Plus, that way it's your fault if it doesn't work out.  :P

I don't think you need to worry about my doing anything that's 
inherently slow. Performance is a big concern where I'm coming from (and 
also where I'm going to; incidentally all job offers I've been getting 
are in high-performance computing). As for excessive enthusiasm, that's 
always a risk but I'm sure I'll hear from you all if I lose my bearings.


Andrei

Jul 30 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Andrei Alexandrescu wrote:
 Daniel Keep wrote:
 ...


 Of course, most people HATE this method because it requires you to write
 mountains of boilerplate code.  Pity, then, it's also the fastest and
 most flexible.  :P  (It's a pity D doesn't have extension methods since
 then you could probably do something along the lines of LINQ to make the
 whole thing utterly painless... but then, I've given up on waiting for
 that.)

 This is basically the only way to map xml parsing to ranges.  As for
 CONSUMING ranges, I think that'd be a bad idea for the same reason
 basing IO entirely on ranges is a bad idea.

 
 Interesting. Could you please give more details about this? Why is
 range-based I/O a bad idea, and what can we do to make it a better one?

(A clarification: I *should* have said "...basing IO entirely on ranges
is -probably- a bad idea".)

<rambling>

My concern is the interface.

Let's take a hypothetical input range that reads from a file.  Since
we're parsing XML, we want it to be character data.  So the interface
might look something like:

struct Stream(T)
{
    T front();
    bool empty();
    void next();
}

(I realise I probably got at least one name wrong; I can't be bothered
digging up the exact names, and it's irrelevant anyway :P)

My concern is that front returns T: a single character.

I wrote an archival tool many, many years ago in VB.  It worked by
reading and writing a single byte at a time, and naturally performed
shockingly.  I knew there had to be a faster way since other programs
didn't crawl like mine was and discovered that reading/writing in larger
blocks gave significantly better performance. [1]

Much of the performance of Tango's IO system (and from the XML parsing
code, too) is that it operates on big arrays wherever it can.  Hell, the
pull parser is, as far as anyone is able to tell, faster than every
other XML parser in existence specifically because it reads the whole
file in one IO operation and then just deals with slices and array access.

That's one half of my worry with this: that the range interface
specifically precludes efficient batch operations.

Another, somewhat smaller concern, is that the range interface is
back-to-front for IO.

Consider a stream: you don't know if the stream is empty until you
attempt to read past the end of it.  Standard input does this, network
sockets do this... probably others.

But the range interface asks "is this empty?", which you can't answer
until you attempt to read from it.  So to implement .empty for a
hypothetical stdin range, you'd need to try reading past the current
location.  If you get a character, you've just modified the underlying
stream.

(Actually, this is more of a concern for me in any situation where
computing the next element of a range is an expensive operation, or an
operation with side-effects.  I had the same issue when attempting to
bind coroutines to the opApply interface.  You had to eagerly compute
the next value in order to answer the question: is there a next element?)

Maybe these won't turn out to be problems in practice.  But my gut
feeling is that IO would be better served by a Tango-style interface
(putting the emphasis on efficient block transfers), with ranges
wrapping that if you're willing to maybe take a performance hit.

</rambling>

Just my exceedingly verbose AU$0.02.

 And what's the way that avoids writing boilerplate code but is slower?
 Is that the method that calls virtual functions (or delegates) upon each
 element received?

(Deleted lots of rambling)

The problem with calling a delegate for every element received is that
all the interfaces that do this suck.  SAX is the prime example of this.

Looking at stuff like Rx
(http://themechanicalbride.blogspot.com/2009/07/introducing-rx-linq-to-events.html),
I'm convinced there must be a way of doing it WELL.  I just don't know
what it is yet.

 ...


[1] I learned so much more back then when I had NO idea what I was
doing, and thus made lots of mistakes.  Sadly, I have a strong physical
aversion to making mistakes, so now I don't take risks.  And because I
know I know I don't like taking risks, I can't trick myself into taking
them.  Curse my endlessly recursive consciousness!

Jul 30 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 31 Jul 2009 02:26:34 -0400, Daniel Keep  
<daniel.keep.lists gmail.com> wrote:

 Another, somewhat smaller concern, is that the range interface is
 back-to-front for IO.

 Consider a stream: you don't know if the stream is empty until you
 attempt to read past the end of it.  Standard input does this, network
 sockets do this... probably others.

 But the range interface asks "is this empty?", which you can't answer
 until you attempt to read from it.  So to implement .empty for a
 hypothetical stdin range, you'd need to try reading past the current
 location.  If you get a character, you've just modified the underlying
 stream.


These problems have been discussed before, I hope they can be solved.  I  
agree with you that streams do not fit the range interface particularly  
well, I think the best solution might be to NOT use ranges to implement  
streams, but allow applying ranges on top of streams for interfacing with  
other ranges.

Here is one past discussion you may find interesting:  
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=90971   
This is the first posting, but towards the end is where I came to the  
realization that ranges and streams don't fit together perfectly.

-Steve

Jul 31 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Daniel Keep wrote:
 Andrei Alexandrescu wrote:
 Interesting. Could you please give more details about this? Why is
 range-based I/O a bad idea, and what can we do to make it a better one?

 
 (A clarification: I *should* have said "...basing IO entirely on ranges
 is -probably- a bad idea".)
 
 <rambling>
 
 My concern is the interface.
 
 Let's take a hypothetical input range that reads from a file.  Since
 we're parsing XML, we want it to be character data.  So the interface
 might look something like:
 
 struct Stream(T)
 {
     T front();
     bool empty();
     void next();
 }
 
 (I realise I probably got at least one name wrong; I can't be bothered
 digging up the exact names, and it's irrelevant anyway :P)

Yah, we had to choose popFront instead of the shorter next because there 
was no obvious corresponding "txen" to extract the last element.

 My concern is that front returns T: a single character.
 
 I wrote an archival tool many, many years ago in VB.  It worked by
 reading and writing a single byte at a time, and naturally performed
 shockingly.  I knew there had to be a faster way since other programs
 didn't crawl like mine was and discovered that reading/writing in larger
 blocks gave significantly better performance. [1]

I see, and I'm glad to dissipate this concern. There are three 
interfaces that Phobos will define: byChar, byLine, and byBlock. So you 
get to choose the transfer unit and transfer mechanism. (byLine allows 
you to choose the separator too.) Nowadays I use text files often so I 
use byLine. It's very rare that you want to process input one character 
at a time, and indeed it would suck if the infrastructure would insist 
that that's the unit of transfer.

 Much of the performance of Tango's IO system (and from the XML parsing
 code, too) is that it operates on big arrays wherever it can.  Hell, the
 pull parser is, as far as anyone is able to tell, faster than every
 other XML parser in existence specifically because it reads the whole
 file in one IO operation and then just deals with slices and array access.

(That's great, but isn't sometimes the file a socket stream?)

I don't see this approach clashing with ranges because arrays are ranges 
so this setup is very natural to implement with ranges.

 That's one half of my worry with this: that the range interface
 specifically precludes efficient batch operations.

Hope this went away.

 Another, somewhat smaller concern, is that the range interface is
 back-to-front for IO.
 
 Consider a stream: you don't know if the stream is empty until you
 attempt to read past the end of it.  Standard input does this, network
 sockets do this... probably others.

 But the range interface asks "is this empty?", which you can't answer
 until you attempt to read from it.  So to implement .empty for a
 hypothetical stdin range, you'd need to try reading past the current
 location.  If you get a character, you've just modified the underlying
 stream.

Yah, however note that if you subsequently copy the range, the 
already-read front is also copied so there's no loss. Problems appear if 
you create e.g. two input ranges from the same FILE* or socket or whatnot.

Walter and I discussed this problem for a long time. I also discussed 
the problem in the newsgroup. I argued that the simplest and most 
natural interface for a pure input stream has only one function getNext 
which at the same time gets the element and bumps the stream. 
Unfortunately, since all forward ranges are also input ranges, that 
interface must also work well for all other ranges (e.g. arrays), in 
which case it would be contorted. We decided to define what we now have.

 (Actually, this is more of a concern for me in any situation where
 computing the next element of a range is an expensive operation, or an
 operation with side-effects.  I had the same issue when attempting to
 bind coroutines to the opApply interface.  You had to eagerly compute
 the next value in order to answer the question: is there a next element?)

Yah but you can always cache the result of the computation. The 
remaining annoyance is that the side effect occurs earlier than you'd 
expect.

 Maybe these won't turn out to be problems in practice.  But my gut
 feeling is that IO would be better served by a Tango-style interface
 (putting the emphasis on efficient block transfers), with ranges
 wrapping that if you're willing to maybe take a performance hit.

I think we can do better by defining a general interface that will work 
for arrays as good as if hand-written.

The MSB is that ranges and block transfer are not at all in conflict.



Andrei

Jul 31 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-07-30 22:42:29 -0400, Benji Smith <dlanguage benjismith.net> said:

 Michael Rynn wrote:
 I did look at the code for the xml module, and posted a suggested bug
 fix to the empty elements problem. I do not have access rights to
 updating the source repository, and at the time was too busy for this.


 
 Andrei Alexandrescu wrote:
 It would be great if you could contribute to Phobos. Two things I hope 
 from any replacement (a) works with ranges and ideally outputs ranges, 
 (b) uses alias functions instead of delegates if necessary.

 
 Interesting. Most XML parsers either produce a "Document" object, or 
 they just execute SAX callbacks. If an XML parser returned a range 
 object, how would you use it?
 
 Usually, I use something like XPath to extract information from an XML 
 doc. Something liek this:
 
     auto doc = parser.parse(xml);
     auto nodes = doc.select("/root//whatever[0][ id]");
 
 I can see how you might do depth-first or breadth-first traversal of 
 the DOM tree, or inorder traversal of the SAX events, with a range. But 
 that's now how most people use XML. Are there are other range tricks up 
 your sleeve that would support the a DOM or XPath kind of model?

A range is mostly a list of things. In the example above, doc.select 
could return a range to lazily evaluate the query instead of computing 
the whole query and returning all the elements. This way, if you only 
care about the first result you just take the first and don't have to 
compute them all.

Ranges can be used everywehere there are lists, and are especially 
useful for lazy lists that compute things as you go. I made an XML 
tokenizer (similar to Tango's pull parser) with a range API. Basically, 
you iterate over various kinds of token made available through an 
Algebraic, and as you advance it parses the document to get you the 
next token. (It'd be more useful if you could switch on various kinds 
of tokens with an Algebraic -- right now you need to use "if 
(token.peek!OpenElementToken)" -- but that's a problem with Algebraic 
that should get fixed I believe, or else I'll have to use something 
else.)

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jul 31 2009

Benji Smith <dlanguage benjismith.net> writes:

Michel Fortin wrote:
 Benji Smith wrote:
 Usually, I use something like XPath to extract information from an XML 
 doc. Something liek this:

     auto doc = parser.parse(xml);
     auto nodes = doc.select("/root//whatever[0][ id]");

 I can see how you might do depth-first or breadth-first traversal of 
 the DOM tree, or inorder traversal of the SAX events, with a range. 
 But that's now how most people use XML. Are there are other range 
 tricks up your sleeve that would support the a DOM or XPath kind of 
 model?

 
 A range is mostly a list of things. In the example above, doc.select 
 could return a range to lazily evaluate the query instead of computing 
 the whole query and returning all the elements. This way, if you only 
 care about the first result you just take the first and don't have to 
 compute them all.
 
 Ranges can be used everywehere there are lists, and are especially 
 useful for lazy lists that compute things as you go. I made an XML 
 tokenizer (similar to Tango's pull parser) with a range API. Basically, 
 you iterate over various kinds of token made available through an 
 Algebraic, and as you advance it parses the document to get you the next 
 token. (It'd be more useful if you could switch on various kinds of 
 tokens with an Algebraic -- right now you need to use "if 
 (token.peek!OpenElementToken)" -- but that's a problem with Algebraic 
 that should get fixed I believe, or else I'll have to use something else.)

But XML documents aren't really lists. They're trees.

Do ranges provide an abstraction for working with trees (other than the 
obvious flattening algorithms, like breadth-first or depth-first traversal)?

--benji

Jul 31 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-08-01 00:04:01 -0400, Benji Smith <dlanguage benjismith.net> said:

 But XML documents aren't really lists. They're trees.
 
 Do ranges provide an abstraction for working with trees (other than the 
 obvious flattening algorithms, like breadth-first or depth-first 
 traversal)?

Well, it depends at what level you look. An XML document you read is 
first a list of bytes, then a list of Unicode characters, then you 
convert those characters to a list of tokens -- the Tango pull-parser 
sees each tag and each attribute as a token, SAX define each tag 
(including attributes) as a token and calls it an event -- and from 
that list of token you can construct a tree.

The tree isn't a list though, and a range is a unidimentional list of 
something. You need another interface to work with the tree.

But then, from the tree, create a list in one way or another 
(flattening, or performing an XPath query for instance) and then you 
can have a range representing the list of subtrees for the query if you 
want. That's pretty good since with a range you can lazily iterate over 
the results.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Aug 01 2009

Benji Smith <dlanguage benjismith.net> writes:

Michel Fortin wrote:
 On 2009-08-01 00:04:01 -0400, Benji Smith <dlanguage benjismith.net> said:
 
 But XML documents aren't really lists. They're trees.

 Do ranges provide an abstraction for working with trees (other than 
 the obvious flattening algorithms, like breadth-first or depth-first 
 traversal)?

 
 Well, it depends at what level you look. An XML document you read is 
 first a list of bytes, then a list of Unicode characters, then you 
 convert those characters to a list of tokens -- the Tango pull-parser 
 sees each tag and each attribute as a token, SAX define each tag 
 (including attributes) as a token and calls it an event -- and from that 
 list of token you can construct a tree.
 
 The tree isn't a list though, and a range is a unidimentional list of 
 something. You need another interface to work with the tree.
 
 But then, from the tree, create a list in one way or another 
 (flattening, or performing an XPath query for instance) and then you can 
 have a range representing the list of subtrees for the query if you 
 want. That's pretty good since with a range you can lazily iterate over 
 the results.

Oh sure. I agree that a range-based way of iterating over tokens is 
cool. And a range-based API for walking through the results of an XPath 
query would be great. But the real meat and potatoes of an XML API would 
need to be something more DOM-like, with a tree structure.

The only reason I chimed in, in the first place, was Andrei's post 
saying that a replacement XML parser "ideally outputs ranges".

I don't think that's right. Ideally, an XML parser outputs a tree structure.

Though a range-based mechanism for traversing that tree would be nice too.

--benji

Aug 01 2009

Michael Rynn <michaelrynn optushome.com.au> writes:

On Sun, 02 Aug 2009 00:25:20 -0400, Benji Smith
<dlanguage benjismith.net> wrote:


An interface using D ranges for the parser?  First you get the parser.

There is a lot of prior art, especially java based open source.
For instance see http://www.xmlpull.org/history/index.html.

Existing XML parsers vary and trade off various features. high level
interface flexibility vs speed vs memory vs validation against schemas
and DTD vs random access vs single pass. 

There are different  flavours of XML parsers. 
They are like sets of spanners and tools of varying shapes and
capacities, to match up to the job criteria.

All methods have to parse through the XML (until completion or search
criteria satisfied ) and ensure valid XML, UTF conversion, Entity
translation.  

Then there is namespace support, and now 2 versions of XPath
documented by w3c  mixed up with XQuery.

It would be nice  to have well defined interfaces for  DOM, SAX and
PULL parsers which share some of the base parsing code. The DOM can be
partial,  as node sets returned from XPath query. Nice how the phobos
parser can make a full DOM or just the bits required.

The current Tango and Phobos parsers are interesting in having their
own special D personality and features, and are reasonably self
contained.  

Its so nice to have choice, and would be nice to have XML parsers of
some varying features that are usable with both standard libraries and
both D1 and D2, and have adequate documentation.   But to achieve just
1 version would be a start.

I would hope for a few different of core parser interfaces in
different modules, and compare versioned features on  them.

So the idea of 1 standard xml parser is a bit limiting.
There is still a need to continue to support and enhance existing
std.xml , even if a more compelling candidate emerges, to replace or
to add in parallel.

Who is using what for XML parsing in native D now?

Aug 04 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-08-04 10:01:51 -0400, Michael Rynn <michaelrynn optushome.com.au> said:

 It would be nice  to have well defined interfaces for  DOM, SAX and
 PULL parsers which share some of the base parsing code. The DOM can be
 partial,  as node sets returned from XPath query. Nice how the phobos
 parser can make a full DOM or just the bits required.

Exactly what I've been working on:

Tokenizer part: http://michelf.com/docs/d/mfr/xmltok.html
DOM part:       http://michelf.com/docs/d/mfr/xml.html

Note that it's still a work in progress. Here are some things I'd like to do:

tokenizer: add specialized exception classes to better report various 
problems, add better checks for valid characters (should be optional), 
better support for ranges (currently only string because I rely on 
"a.before(b)" to avoid dynamic allocation), also add support for the 
internal subset in the doctype (but that's low priority).

Writer: replace by a simple template function and a toString function 
defined for each token type? or a writeTo function (to avoid creating a 
intermediary string)?

XMLForwardRange: allow a template parameter specifying the token types 
you want to see, skipping all others. This could be done by passing a 
custom Algebraic type instead of the provided one what can contain all 
tokens.

DOM classes: it's mostly experimental for now.

There's no SAX yet, although it should be trivial to add over the 
existing callback tokenizer.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Aug 04 2009

D Programming

C/C++ Programming

Other

digitalmars.D - The XML module in Phobos