digitalmars.D - Phobos Proposal: replace std.xml with kxml.

Bernard Helyer (18/18) May 03 2010 When I first started using D, one of the things I needed quite early on

Adam D. Ruppe (12/14) May 03 2010 I've got something almost ready to be thrown into the ring too, my

BCS (4/10) May 04 2010 That can be handy but can also lead to problems.

Graham Fawcett (16/40) May 03 2010 I haven't looked at kxml -- but why not just wrap libxml2? It's widely

Andrei Alexandrescu (4/49) May 03 2010 I think what we need for the standard library is to take a solid XML

Bernard Helyer (2/5) May 03 2010 Care to give an example?

Ellery Newcomer (6/12) May 03 2010 I was curious about this too.

Richard Webb (3/3) May 04 2010 RapidXML also uses the Boost license (it's included as part of the Boost

Graham Fawcett (6/22) May 04 2010 By "adapt" do you mean writing a wrapper for an existing library, or

Andrei Alexandrescu (4/29) May 04 2010 We'd need to modify the code. I haven't looked into available xml

Graham Fawcett (16/46) May 04 2010 I think I understand your motivations: this is standard library, and

Andrei Alexandrescu (25/73) May 04 2010 My concern is purely technical - a library we just link to would force a...

Graham Fawcett (5/78) May 04 2010 That's a strong argument -- thank you for taking the time to respond.

Michel Fortin (37/47) May 04 2010 I think if you wanted to port an XML library to make use of ranges, the

Andrei Alexandrescu (11/61) May 04 2010 Design is also a considerable time expense, though I agree that use of

Michel Fortin (59/84) May 04 2010 If someone else wants to use it, I offer it. Otherwise I'll surely

Robert Jacques (13/21) May 04 2010 What about using forward ranges and take?

Bernard Helyer <b.helyer gmail.com> writes:

When I first started using D, one of the things I needed quite early on 
was a way of writing and reading XML. Naturally, when I saw std.xml in 
Phobos, I was quite pleased.

That was of course, until I started to use it.

http://d.puremagic.com/issues/show_bug.cgi?id=3088
http://d.puremagic.com/issues/show_bug.cgi?id=4069
http://d.puremagic.com/issues/show_bug.cgi?id=3201

I vented my frustation on IRC, where opticron mentioned he had an XML 
library of his own. I find it superior to std.xml, especially 
considering how it actually works, and is maintained.

http://opticron.no-ip.org/svn/branches/kxml/

It is already under the Boost License, and opticron has said

"<opticron> ... if they really want to snag mine and clean it up for use 
in phobos, that's fine
<opticron> I'd even relicense the code if I have to"

I'm going to keep on using kxml regardless, but I thought it would be 
nice if Phobos had a working XML library. What say you?



-Bernard.

May 03 2010

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Tue, May 04, 2010 at 09:18:46AM +1200, Bernard Helyer wrote:
 I'm going to keep on using kxml regardless, but I thought it would be 
 nice if Phobos had a working XML library. What say you?

I've got something almost ready to be thrown into the ring too, my
DOM lib.: http://arsdnet.net/dcode/dom.d

My initial goal here was to mimic Javascript in the browser, but it has
since grown to have a bunch of extensions too.

An important aspect of mimicing js is I don't really care about the xml
standard; it tries to figure out whatever ugly crap you throw its way
and makes a few assumptions for html. But, it can be used for generic xml
stuff too.


-- 
Adam D. Ruppe
http://arsdnet.net

May 03 2010

BCS <none anon.com> writes:

Hello Adam,

 An important aspect of mimicing js is I don't really care about the
 xml
 standard; it tries to figure out whatever ugly crap you throw its way
 and makes a few assumptions for html. But, it can be used for generic
 xml
 stuff too.

That can be handy but can also lead to problems.

-- 
... <IXOYE><

May 04 2010

Graham Fawcett <fawcett uwindsor.ca> writes:

On Tue, 04 May 2010 09:18:46 +1200, Bernard Helyer wrote:

 When I first started using D, one of the things I needed quite early on
 was a way of writing and reading XML. Naturally, when I saw std.xml in
 Phobos, I was quite pleased.
 
 That was of course, until I started to use it.
 
 http://d.puremagic.com/issues/show_bug.cgi?id=3088
 http://d.puremagic.com/issues/show_bug.cgi?id=4069
 http://d.puremagic.com/issues/show_bug.cgi?id=3201
 
 I vented my frustation on IRC, where opticron mentioned he had an XML
 library of his own. I find it superior to std.xml, especially
 considering how it actually works, and is maintained.
 
 http://opticron.no-ip.org/svn/branches/kxml/
 
 It is already under the Boost License, and opticron has said
 
 "<opticron> ... if they really want to snag mine and clean it up for use
 in phobos, that's fine
 <opticron> I'd even relicense the code if I have to"
 
 I'm going to keep on using kxml regardless, but I thought it would be
 nice if Phobos had a working XML library. What say you?

I haven't looked at kxml -- but why not just wrap libxml2? It's widely 
regarded as a fast, stable, portable and *correct* XML library. I wrote a 
partial libxml2 wrapper (mostly the tree.h stuff, and some libxslt) in 
under an hour as a learning exercise; someone with real D chops could 
turn out a polished interface in short time.

The fact that libxml2/libxslt support not only XML parsing and DOM 
building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc., means 
that any homegrown library will be hard-pressed to cover the same range 
of tools and features.

There are too many half-baked XML libraries in the world. No disrespect 
intended to opticron or anyone else; it just doesn't make a lot of sense 
to reinvent such a complex wheel (and believing that XML processing isn't 
complex is a sure sign that your homegrown library's design is 
incomplete!).

Graham

May 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Graham Fawcett wrote:
 On Tue, 04 May 2010 09:18:46 +1200, Bernard Helyer wrote:
 
 When I first started using D, one of the things I needed quite early on
 was a way of writing and reading XML. Naturally, when I saw std.xml in
 Phobos, I was quite pleased.

 That was of course, until I started to use it.

 http://d.puremagic.com/issues/show_bug.cgi?id=3088
 http://d.puremagic.com/issues/show_bug.cgi?id=4069
 http://d.puremagic.com/issues/show_bug.cgi?id=3201

 I vented my frustation on IRC, where opticron mentioned he had an XML
 library of his own. I find it superior to std.xml, especially
 considering how it actually works, and is maintained.

 http://opticron.no-ip.org/svn/branches/kxml/

 It is already under the Boost License, and opticron has said

 "<opticron> ... if they really want to snag mine and clean it up for use
 in phobos, that's fine
 <opticron> I'd even relicense the code if I have to"

 I'm going to keep on using kxml regardless, but I thought it would be
 nice if Phobos had a working XML library. What say you?

 
 I haven't looked at kxml -- but why not just wrap libxml2? It's widely 
 regarded as a fast, stable, portable and *correct* XML library. I wrote a 
 partial libxml2 wrapper (mostly the tree.h stuff, and some libxslt) in 
 under an hour as a learning exercise; someone with real D chops could 
 turn out a polished interface in short time.
 
 The fact that libxml2/libxslt support not only XML parsing and DOM 
 building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc., means 
 that any homegrown library will be hard-pressed to cover the same range 
 of tools and features.
 
 There are too many half-baked XML libraries in the world. No disrespect 
 intended to opticron or anyone else; it just doesn't make a lot of sense 
 to reinvent such a complex wheel (and believing that XML processing isn't 
 complex is a sure sign that your homegrown library's design is 
 incomplete!).
 
 Graham

I think what we need for the standard library is to take a solid XML 
library licensed generously and adapt it to work with arbitrary ranges.

Andrei

May 03 2010

Bernard Helyer <b.helyer gmail.com> writes:

On 04/05/10 11:01, Andrei Alexandrescu wrote:
 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary ranges.

 Andrei

Care to give an example?

May 03 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 05/03/2010 06:24 PM, Bernard Helyer wrote:
 On 04/05/10 11:01, Andrei Alexandrescu wrote:
 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary ranges.

 Andrei

 Care to give an example?

I was curious about this too.

When I looked around, I saw

TinyXML (zlib)
POCO xml (boost)

but I've never used either and couldn't say whether either is solid.

May 03 2010

Richard Webb <webby beardmouse.co.uk> writes:

RapidXML also uses the Boost license (it's included as part of the Boost
PropertyTree library).
I haven't used it though, so i can't say how i compares to the others.

May 04 2010

Graham Fawcett <fawcett uwindsor.ca> writes:

On Mon, 03 May 2010 16:01:30 -0700, Andrei Alexandrescu wrote:

 Graham Fawcett wrote:
 The fact that libxml2/libxslt support not only XML parsing and DOM
 building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc.,
 means that any homegrown library will be hard-pressed to cover the same
 range of tools and features.
 
 There are too many half-baked XML libraries in the world. No disrespect
 intended to opticron or anyone else; it just doesn't make a lot of
 sense to reinvent such a complex wheel (and believing that XML
 processing isn't complex is a sure sign that your homegrown library's
 design is incomplete!).
 
 Graham

 
 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary ranges.

By "adapt" do you mean writing a wrapper for an existing library, or 
translating the source code of the library into D? 

What constitutes a "generous license" in this context? (For what it's 
worth, libxml2 is under the MIT License.)

Graham

May 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Graham Fawcett wrote:
 On Mon, 03 May 2010 16:01:30 -0700, Andrei Alexandrescu wrote:
 
 Graham Fawcett wrote:
 The fact that libxml2/libxslt support not only XML parsing and DOM
 building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc.,
 means that any homegrown library will be hard-pressed to cover the same
 range of tools and features.

 There are too many half-baked XML libraries in the world. No disrespect
 intended to opticron or anyone else; it just doesn't make a lot of
 sense to reinvent such a complex wheel (and believing that XML
 processing isn't complex is a sure sign that your homegrown library's
 design is incomplete!).

 Graham

 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary ranges.

 
 By "adapt" do you mean writing a wrapper for an existing library, or 
 translating the source code of the library into D? 
 
 What constitutes a "generous license" in this context? (For what it's 
 worth, libxml2 is under the MIT License.)
 
 Graham

We'd need to modify the code. I haven't looked into available xml 
libraries so I don't know which would be eligible.

Andrei

May 04 2010

Graham Fawcett <fawcett uwindsor.ca> writes:

On Tue, 04 May 2010 09:09:29 -0700, Andrei Alexandrescu wrote:

 Graham Fawcett wrote:
 On Mon, 03 May 2010 16:01:30 -0700, Andrei Alexandrescu wrote:
 
 Graham Fawcett wrote:
 The fact that libxml2/libxslt support not only XML parsing and DOM
 building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc.,
 means that any homegrown library will be hard-pressed to cover the
 same range of tools and features.

 There are too many half-baked XML libraries in the world. No
 disrespect intended to opticron or anyone else; it just doesn't make
 a lot of sense to reinvent such a complex wheel (and believing that
 XML processing isn't complex is a sure sign that your homegrown
 library's design is incomplete!).

 Graham

 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary
 ranges.

 
 By "adapt" do you mean writing a wrapper for an existing library, or
 translating the source code of the library into D?
 
 What constitutes a "generous license" in this context? (For what it's
 worth, libxml2 is under the MIT License.)
 
 Graham

 
 We'd need to modify the code. I haven't looked into available xml
 libraries so I don't know which would be eligible.

I think I understand your motivations: this is standard library, and
so you want to minimize dependencies. But from a maintenance
perspective, it seems a bad idea to translate a complex library into D
code that few people will actively maintain -- whereas writing a
wrapper (and introducing a library dependency) would keep the codebase
small, let you share maintenance costs with the third-party library's
developers, and (arguably) increase the stability and quality of the
stdlib?

I am not pushing for libxml2 as The Answer. I'm just questioning the
motivation to translate other people's code to D, when the D platform
excels at library integration. (Although I agree with your suggestion
to borrow inspiration/code from Boost for datetime and other features;
that's different, since Boost cannot feasibly be wrapped.)

Best,
Graham

May 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Graham Fawcett wrote:
 On Tue, 04 May 2010 09:09:29 -0700, Andrei Alexandrescu wrote:
 
 Graham Fawcett wrote:
 On Mon, 03 May 2010 16:01:30 -0700, Andrei Alexandrescu wrote:

 Graham Fawcett wrote:
 The fact that libxml2/libxslt support not only XML parsing and DOM
 building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc.,
 means that any homegrown library will be hard-pressed to cover the
 same range of tools and features.

 There are too many half-baked XML libraries in the world. No
 disrespect intended to opticron or anyone else; it just doesn't make
 a lot of sense to reinvent such a complex wheel (and believing that
 XML processing isn't complex is a sure sign that your homegrown
 library's design is incomplete!).

 Graham

 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary
 ranges.

 By "adapt" do you mean writing a wrapper for an existing library, or
 translating the source code of the library into D?

 What constitutes a "generous license" in this context? (For what it's
 worth, libxml2 is under the MIT License.)

 Graham

 We'd need to modify the code. I haven't looked into available xml
 libraries so I don't know which would be eligible.

 
 I think I understand your motivations: this is standard library, and
 so you want to minimize dependencies. But from a maintenance
 perspective, it seems a bad idea to translate a complex library into D
 code that few people will actively maintain -- whereas writing a
 wrapper (and introducing a library dependency) would keep the codebase
 small, let you share maintenance costs with the third-party library's
 developers, and (arguably) increase the stability and quality of the
 stdlib?
 
 I am not pushing for libxml2 as The Answer. I'm just questioning the
 motivation to translate other people's code to D, when the D platform
 excels at library integration. (Although I agree with your suggestion
 to borrow inspiration/code from Boost for datetime and other features;
 that's different, since Boost cannot feasibly be wrapped.)
 
 Best,
 Graham

My concern is purely technical - a library we just link to would force a 
number of choices, such as input representation (e.g. arrays of char). 
Ideally we should be able to change the library to accept any compatible 
range of any compatible characters.

As a simple example, consider std.algorithm.levenshteinDistance. There 
are plenty of good implementations and initially I just wrote one almost 
identical to the Web lore. However, later I needed to compute 
Levenshtein distances between strings stored in lists (tries, actually). 
Well that doesn't work because the implementation at that time used 
random access s[i] and t[i] all over the place. But it wasn't difficult 
to change the algorithm to work with forward ranges. So now we have one 
of the few Levenshtein distance implementations that work with other 
inputs than arrays. In particular, we work correctly with UTF inputs 
without needing to copy the input, something that I haven't seen 
anywhere else. If you google for ``levenshtein utf'' Google will even 
think the query has a typo. Search results include an OCaml 
implementation that copies the input 
(http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levensh
ein_distance#OCaml) 
and a Ruby implementation that also copies the input 
(http://rubyforge.org/frs/?group_id=2080&release_id=7389). By using the 
range abstraction, we get to support UTF Levenshtein without significant 
additional implementation effort - the code is very similar to the one 
using indices throughout.



Andrei

May 04 2010

Graham Fawcett <fawcett uwindsor.ca> writes:

On Tue, 04 May 2010 11:56:31 -0700, Andrei Alexandrescu wrote:

 Graham Fawcett wrote:
 On Tue, 04 May 2010 09:09:29 -0700, Andrei Alexandrescu wrote:
 
 Graham Fawcett wrote:
 On Mon, 03 May 2010 16:01:30 -0700, Andrei Alexandrescu wrote:

 Graham Fawcett wrote:
 The fact that libxml2/libxslt support not only XML parsing and DOM
 building, but also XSLT, XPath, XPointer, XInclude, RelaxNG, etc.,
 means that any homegrown library will be hard-pressed to cover the
 same range of tools and features.

 There are too many half-baked XML libraries in the world. No
 disrespect intended to opticron or anyone else; it just doesn't
 make a lot of sense to reinvent such a complex wheel (and believing
 that XML processing isn't complex is a sure sign that your
 homegrown library's design is incomplete!).

 Graham

 I think what we need for the standard library is to take a solid XML
 library licensed generously and adapt it to work with arbitrary
 ranges.

 By "adapt" do you mean writing a wrapper for an existing library, or
 translating the source code of the library into D?

 What constitutes a "generous license" in this context? (For what it's
 worth, libxml2 is under the MIT License.)

 Graham

 We'd need to modify the code. I haven't looked into available xml
 libraries so I don't know which would be eligible.

 
 I think I understand your motivations: this is standard library, and so
 you want to minimize dependencies. But from a maintenance perspective,
 it seems a bad idea to translate a complex library into D code that few
 people will actively maintain -- whereas writing a wrapper (and
 introducing a library dependency) would keep the codebase small, let
 you share maintenance costs with the third-party library's developers,
 and (arguably) increase the stability and quality of the stdlib?
 
 I am not pushing for libxml2 as The Answer. I'm just questioning the
 motivation to translate other people's code to D, when the D platform
 excels at library integration. (Although I agree with your suggestion
 to borrow inspiration/code from Boost for datetime and other features;
 that's different, since Boost cannot feasibly be wrapped.)
 
 Best,
 Graham

 
 My concern is purely technical - a library we just link to would force a
 number of choices, such as input representation (e.g. arrays of char).
 Ideally we should be able to change the library to accept any compatible
 range of any compatible characters.
 
 As a simple example, consider std.algorithm.levenshteinDistance. There
 are plenty of good implementations and initially I just wrote one almost
 identical to the Web lore. However, later I needed to compute
 Levenshtein distances between strings stored in lists (tries, actually).
 Well that doesn't work because the implementation at that time used
 random access s[i] and t[i] all over the place. But it wasn't difficult
 to change the algorithm to work with forward ranges. So now we have one
 of the few Levenshtein distance implementations that work with other
 inputs than arrays. In particular, we work correctly with UTF inputs
 without needing to copy the input, something that I haven't seen
 anywhere else. If you google for ``levenshtein utf'' Google will even
 think the query has a typo. Search results include an OCaml
 implementation that copies the input
 (http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/

Levenshtein_distance#OCaml)
 and a Ruby implementation that also copies the input
 (http://rubyforge.org/frs/?group_id=2080&release_id=7389). By using the
 range abstraction, we get to support UTF Levenshtein without significant
 additional implementation effort - the code is very similar to the one
 using indices throughout.

That's a strong argument -- thank you for taking the time to respond.

Regards,
Graham

May 04 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-05-04 12:09:29 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Graham Fawcett wrote:
 By "adapt" do you mean writing a wrapper for an existing library, or 
 translating the source code of the library into D?
 What constitutes a "generous license" in this context? (For what it's 
 worth, libxml2 is under the MIT License.)
 
 Graham

 
 We'd need to modify the code. I haven't looked into available xml 
 libraries so I don't know which would be eligible.

I think if you wanted to port an XML library to make use of ranges, the 
only viable option is probably to find one based on C++ iterators. 
Otherwise it'll look more like a rewrite than a port, and at this point 
why not write one from scratch?

Anyway, just in case, would you be interested in an XML tokenizer and 
simple DOM following this model?

	http://michelf.com/docs/d/mfr/xmltok.html
	http://michelf.com/docs/d/mfr/xml.html

At the base is a pull parser and an event parser mixed in the same 
function template: "tokenize", allowing you to alternate between 
even-based and pull-parsing at will. I'm using it, but its development 
is on hold at this time, I'm just maintaining it so it compiles on the 
newest versions of DMD.

The only thing it doesn't parse at this time is inline DTDs inside the doctype.

Also, it currently only works only with strings, for simplicity and 
performance. There is one issue about non-string parsing: when parsing 
a string, it's easy to just slice the string and move it around, but if 
you're parsing from a generic input range, you basically have to copy 
characters one by one, which is much less efficient. So ideally the 
algorithm should use slices whenever it can (when the input is a 
string).

I'm not sure yet how to attack this problem, but I'm thinking that 
perhaps parsing primitives should be "part of" the range interface. I 
say this in the sense that a range should provide specialized 
implementation of primitive when it can implement them more efficiently 
(like by slicing). You wrote a while ago about designing parsing 
primitives, is this part of Phobos now?

Anyway, the problem above is probably the one reason we might want to 
write the parser from scratch: it needs to bind to specializable 
higher-level parsing functions to take advantage of the performance 
characteristics of certain ranges, such as those you can slice.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

May 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michel Fortin wrote:
 On 2010-05-04 12:09:29 -0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 Graham Fawcett wrote:
 By "adapt" do you mean writing a wrapper for an existing library, or 
 translating the source code of the library into D?
 What constitutes a "generous license" in this context? (For what it's 
 worth, libxml2 is under the MIT License.)

 Graham

 We'd need to modify the code. I haven't looked into available xml 
 libraries so I don't know which would be eligible.

 
 I think if you wanted to port an XML library to make use of ranges, the 
 only viable option is probably to find one based on C++ iterators. 
 Otherwise it'll look more like a rewrite than a port, and at this point 
 why not write one from scratch?

Design is also a considerable time expense, though I agree that use of 
ranges may actually improve the design too.

 Anyway, just in case, would you be interested in an XML tokenizer and 
 simple DOM following this model?
 
     http://michelf.com/docs/d/mfr/xmltok.html
     http://michelf.com/docs/d/mfr/xml.html
 
 At the base is a pull parser and an event parser mixed in the same 
 function template: "tokenize", allowing you to alternate between 
 even-based and pull-parsing at will. I'm using it, but its development 
 is on hold at this time, I'm just maintaining it so it compiles on the 
 newest versions of DMD.

Sounds great, but I need to defer XML expertise to others.

 The only thing it doesn't parse at this time is inline DTDs inside the 
 doctype.
 
 Also, it currently only works only with strings, for simplicity and 
 performance. There is one issue about non-string parsing: when parsing a 
 string, it's easy to just slice the string and move it around, but if 
 you're parsing from a generic input range, you basically have to copy 
 characters one by one, which is much less efficient. So ideally the 
 algorithm should use slices whenever it can (when the input is a string).
 
 I'm not sure yet how to attack this problem, but I'm thinking that 
 perhaps parsing primitives should be "part of" the range interface. I 
 say this in the sense that a range should provide specialized 
 implementation of primitive when it can implement them more efficiently 
 (like by slicing). You wrote a while ago about designing parsing 
 primitives, is this part of Phobos now?
 
 Anyway, the problem above is probably the one reason we might want to 
 write the parser from scratch: it needs to bind to specializable 
 higher-level parsing functions to take advantage of the performance 
 characteristics of certain ranges, such as those you can slice.

There are a number of issues. One is that you should allow wchar and 
dchar in addition to char as basic character types (probably ubyte too 
for exotic encodings). In essence the char type should be a template 
parameter. The other is that perhaps you could be able to use zero-based 
slices, i.e. s[0 .. i] as opposed to arbitrary slices s[i .. j]. A 
zero-based slice can be supported better than an arbitrary one.


Andrei

May 04 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-05-04 19:41:39 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Anyway, just in case, would you be interested in an XML tokenizer and 
 simple DOM following this model?
 
     http://michelf.com/docs/d/mfr/xmltok.html
     http://michelf.com/docs/d/mfr/xml.html
 
 At the base is a pull parser and an event parser mixed in the same 
 function template: "tokenize", allowing you to alternate between 
 even-based and pull-parsing at will. I'm using it, but its development 
 is on hold at this time, I'm just maintaining it so it compiles on the 
 newest versions of DMD.

 
 Sounds great, but I need to defer XML expertise to others.

If someone else wants to use it, I offer it. Otherwise I'll surely 
continue working on it eventually.


 Anyway, the problem above is probably the one reason we might want to 
 write the parser from scratch: it needs to bind to specializable 
 higher-level parsing functions to take advantage of the performance 
 characteristics of certain ranges, such as those you can slice.

 
 There are a number of issues. One is that you should allow wchar and 
 dchar in addition to char as basic character types (probably ubyte too 
 for exotic encodings). In essence the char type should be a template 
 parameter.

I totally agree about wchar and dchar... and you also need ubyte with 
an encoding detection system (checking the encoding in the xml prolog) 
to correctly parse XML files in any encoding. I think I have a ubyte 
parser, but it currently just accepts UTF-8 and then branch to the 
"string" version. (UTF-16 would be needed to implement correctly the 
XML spec.)


 The other is that perhaps you could be able to use zero-based slices, 
 i.e. s[0 .. i] as opposed to arbitrary slices s[i .. j]. A zero-based 
 slice can be supported better than an arbitrary one.

Yeah, well, it doesn't really work so well. You have to parse the input 
before slicing. XML doesn't tell you in advance how many characters or 
code units a string will take.

What you need is more like this:

	// Example XML Parser

	bool isAtEndOfAttributeContent(dchar char) {
		if (char == '"') return true;
		if (char == '<') throw new ParseError();
		return false;
	}

	void parseXML(T)(T input) if (IsInputRange!T) {
		[...]
		case '"':
			input.popFront(); // remove leading quote
			string content = readUntil!(isAtEndOfAttributeContent)(input);
			assert(input.front == '"');
			input.popFront(); // remove tailing quote
		[...]
	}


	// Example parsing primitive

	// String version: can slice
	string readUntil(isAtEndPredicate)(ref string input) {
		string savedInput;
		while (!input.empty && isAtEndPredicate(input.front)) {
			input.popFront();
		}
		return savedInput[0..$-input.length];
	}

	// Generic input range version: can't slice, must copy
	immutable(ElementType!T)[] readUntil(isAtEndPredicate, T)(T input) if 
(IsInputRange!T) {
		immutable(ElementType!T)[] copy; // should use appender here
		while (!input.empty) {
			dchar frontChar = input.front;
			if (isAtEndPredicate(frontChar))
				break;
			else
			copy ~= frontChar;
			input.popFront();
		}
		return copy;
	}

It's easy to appreciate the difference in performance between the 
string version and the generic version of readUntil just by looking at 
the code.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

May 04 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Tue, 04 May 2010 21:55:53 -0400, Michel Fortin  
<michel.fortin michelf.com> wrote:
 	// String version: can slice
 	string readUntil(isAtEndPredicate)(ref string input) {
 		string savedInput;
 		while (!input.empty && isAtEndPredicate(input.front)) {
 			input.popFront();
 		}
 		return savedInput[0..$-input.length];
 	}

What about using forward ranges and take?

	// String version: can slice
	Take!T readUntil(isAtEndPredicate, T)(ref T input) if(isForwardRange!T) {
		auto savedInput = input;
                 size_t n = 0;
		while (!input.empty && isAtEndPredicate(input.front)) {
			input.popFront();
                         n++;
		}
		return take(savedInput,n);
	}

May 04 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Phobos Proposal: replace std.xml with kxml.