digitalmars.D - std.xml: Why is it so slow? Is there anything else wrong with it?

dsimcha (20/20) Mar 12 2011 There seems to be a consensus around here that Phobos needs a good XML

Daniel Gibson (5/25) Mar 12 2011 (These questions should probably discusses nevertheless)

Jonathan M Davis (33/59) Mar 12 2011 As I understand it, one of the main issues is that std.xml is delegate-b...

Bekenn (1/1) Mar 12 2011 Do we want to take a look at libxml, or are there legal issues with that...
Russel Winder (27/51) Mar 13 2011 =20
Jonathan M Davis (14/50) Mar 13 2011 Well, Tom is working a new std.xml regardless, but I would fully expect ...

dsimcha <dsimcha yahoo.com> writes:

There seems to be a consensus around here that Phobos needs a good XML 
module, and that std.xml doesn't cut it, at least partly due to 
performance issues.  I have no clue how to write a good XML module from 
scratch.  It seems like noone else is taking up the project either. 
This leads me to two questions:

1.  Has anyone ever sat down and tried to figure out **why** std.xml is 
so slow?  Seriously, if noone's bothered to profile it or read the code 
carefully, then for all we know there might be some low hanging fruit 
and it might be an afternoon of optimization away from being reasonably 
fast.  Basically every experience I've ever had suggests that, if a 
piece of code has not already been profiled and heavily optimized, at 
least a 5-fold speedup can almost always be obtained just by optimizing 
the low-hanging fruit.  (For example, see my recent pull request for the 
D garbage collector.  BTW, if excessive allocations are a contributing 
factor, then fixing the GC should help with XML, too.)

If the answer is no, this hasn't been done, please post some canned 
benchmarks and maybe I'll take a crack at it.

2.  What other major defects/design flaws, if any, does std.xml have?

In other words, how are we really so sure that we need to start from 
scratch?

Mar 12 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 13.03.2011 05:34, schrieb dsimcha:
 There seems to be a consensus around here that Phobos needs a good XML
 module, and that std.xml doesn't cut it, at least partly due to
 performance issues. I have no clue how to write a good XML module from
 scratch. It seems like noone else is taking up the project either. This
 leads me to two questions:

Isn't Tomek Sowiński working on it?

 1. Has anyone ever sat down and tried to figure out **why** std.xml is
 so slow? Seriously, if noone's bothered to profile it or read the code
 carefully, then for all we know there might be some low hanging fruit
 and it might be an afternoon of optimization away from being reasonably
 fast. Basically every experience I've ever had suggests that, if a piece
 of code has not already been profiled and heavily optimized, at least a
 5-fold speedup can almost always be obtained just by optimizing the
 low-hanging fruit. (For example, see my recent pull request for the D
 garbage collector. BTW, if excessive allocations are a contributing
 factor, then fixing the GC should help with XML, too.)

 If the answer is no, this hasn't been done, please post some canned
 benchmarks and maybe I'll take a crack at it.

 2. What other major defects/design flaws, if any, does std.xml have?

 In other words, how are we really so sure that we need to start from
 scratch?

(These questions should probably discusses nevertheless)

Cheers,
- Daniel

Mar 12 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 12 March 2011 20:39:31 Daniel Gibson wrote:
 Am 13.03.2011 05:34, schrieb dsimcha:
 There seems to be a consensus around here that Phobos needs a good XML
 module, and that std.xml doesn't cut it, at least partly due to
 performance issues. I have no clue how to write a good XML module from
 scratch. It seems like noone else is taking up the project either. This

=20
 leads me to two questions:

 Isn't Tomek Sowi=C5=84ski working on it?

Yes.

 1. Has anyone ever sat down and tried to figure out **why** std.xml is
 so slow? Seriously, if noone's bothered to profile it or read the code
 carefully, then for all we know there might be some low hanging fruit
 and it might be an afternoon of optimization away from being reasonably
 fast. Basically every experience I've ever had suggests that, if a piece
 of code has not already been profiled and heavily optimized, at least a
 5-fold speedup can almost always be obtained just by optimizing the
 low-hanging fruit. (For example, see my recent pull request for the D
 garbage collector. BTW, if excessive allocations are a contributing
 factor, then fixing the GC should help with XML, too.)
=20
 If the answer is no, this hasn't been done, please post some canned
 benchmarks and maybe I'll take a crack at it.
=20
 2. What other major defects/design flaws, if any, does std.xml have?
=20
 In other words, how are we really so sure that we need to start from
 scratch?


As I understand it, one of the main issues is that std.xml is delegate-base=
d. I=20
don't know how well it does with slicing and avoiding copying strings, but =
one=20
of the biggest advantages that D has is its array slicing. And taking full=
=20
advantage of that and avoiding string copying is one of - if not _the_ best=
 -=20
way to make std.xml lightning fast.

In any case, there was a discussion about std.xml recently, and the consens=
us=20
was that we should just throw it out rather than leave it there and have pe=
ople=20
complain about how bad Phobos' xml module is.

As Daniel pointed out, Tomek Sowi=C5=84ski is currently working on a new st=
d.xml. I=20
don't know how far along he is or when he expects it to be done, but suppos=
edly=20
he's working on it and sometime reasonably soon we should have a new std.xm=
l to=20
review.

We are definitely _not_ going to be working on improving the current std.xm=
l=20
though. I think that the only reason that it's still there is that Andrei d=
idn't=20
get around to throwing it out before the last release (or at least deprecat=
ing=20
it). That's definitely what he wants to do, and the consensus was in favor =
of=20
that decision.

=2D Jonathan M Davis

Mar 12 2011

Bekenn <leaveme alone.com> writes:

Do we want to take a look at libxml, or are there legal issues with that?

Mar 12 2011

Russel Winder <russel russel.org.uk> writes:

On Sat, 2011-03-12 at 23:34 -0500, dsimcha wrote:
 There seems to be a consensus around here that Phobos needs a good XML=

=20
 module, and that std.xml doesn't cut it, at least partly due to=20
 performance issues.  I have no clue how to write a good XML module from=

=20
 scratch.  It seems like noone else is taking up the project either.=20

I just worry that creating a whole self-standing library is a waste of
time when wrapping libxml2 and libxslt gets a fast XML subsystem for
free.  This is the direction Python has gone. cf.  the lxml package to
replace ElementTree.  The elephant in the room is of course W3C DOM.
Everyone believes they have to have an implementation, but no-one then
uses it.

 This leads me to two questions:
=20
 1.  Has anyone ever sat down and tried to figure out **why** std.xml is=

=20
 so slow?  Seriously, if noone's bothered to profile it or read the code=

=20
 carefully, then for all we know there might be some low hanging fruit=20
 and it might be an afternoon of optimization away from being reasonably=

=20
 fast.  Basically every experience I've ever had suggests that, if a=20
 piece of code has not already been profiled and heavily optimized, at=20
 least a 5-fold speedup can almost always be obtained just by optimizing=

=20
 the low-hanging fruit.  (For example, see my recent pull request for the=

=20
 D garbage collector.  BTW, if excessive allocations are a contributing=

=20
 factor, then fixing the GC should help with XML, too.)
=20
 If the answer is no, this hasn't been done, please post some canned=20
 benchmarks and maybe I'll take a crack at it.
=20
 2.  What other major defects/design flaws, if any, does std.xml have?
=20
 In other words, how are we really so sure that we need to start from=20
 scratch?

Excellent question.  Especially given the existence of libxml2 and
libxslt.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 13 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 13 March 2011 01:11:05 Russel Winder wrote:
 On Sat, 2011-03-12 at 23:34 -0500, dsimcha wrote:
 There seems to be a consensus around here that Phobos needs a good XML
 module, and that std.xml doesn't cut it, at least partly due to
 performance issues.  I have no clue how to write a good XML module from
 scratch.  It seems like noone else is taking up the project either.

 
 I just worry that creating a whole self-standing library is a waste of
 time when wrapping libxml2 and libxslt gets a fast XML subsystem for
 free.  This is the direction Python has gone. cf.  the lxml package to
 replace ElementTree.  The elephant in the room is of course W3C DOM.
 Everyone believes they have to have an implementation, but no-one then
 uses it.
 
 This leads me to two questions:
 
 1.  Has anyone ever sat down and tried to figure out **why** std.xml is
 so slow?  Seriously, if noone's bothered to profile it or read the code
 carefully, then for all we know there might be some low hanging fruit
 and it might be an afternoon of optimization away from being reasonably
 fast.  Basically every experience I've ever had suggests that, if a
 piece of code has not already been profiled and heavily optimized, at
 least a 5-fold speedup can almost always be obtained just by optimizing
 the low-hanging fruit.  (For example, see my recent pull request for the
 D garbage collector.  BTW, if excessive allocations are a contributing
 factor, then fixing the GC should help with XML, too.)
 
 If the answer is no, this hasn't been done, please post some canned
 benchmarks and maybe I'll take a crack at it.
 
 2.  What other major defects/design flaws, if any, does std.xml have?
 
 In other words, how are we really so sure that we need to start from
 scratch?

 
 Excellent question.  Especially given the existence of libxml2 and
 libxslt.

Well, Tom is working a new std.xml regardless, but I would fully expect a 
properly implemented xml library in D to cream something like libxml. D's 
slicing abilities give it a _huge_ advantage when it comes to stuff like
parsing. 
libxml isn't going to be able to take advantage of that. Tango's XML parser is 
_extremely_ fast ( http://dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-
fast-at-parsing-xml/ ), and one of the biggest reasons for that is D's slicing 
abilities. Parsing is one place where D should be able to seriously shine and
is 
_definitely_ one of the places where we _don't_ want to wrap a C library if we 
don't have to.

But regardless, a new std.xml is in the works, and hopefully it'll be up for 
review within the next couple of months (I have no idea how fast Tom is making 
progress on it though).

- Jonathan M Davis

Mar 13 2011

D Programming

C/C++ Programming

Other

digitalmars.D - std.xml: Why is it so slow? Is there anything else wrong with it?