digitalmars.D - High performance XML parser

Tomek =?UTF-8?B?U293acWEc2tp?= (7/7) Feb 04 2011 I am now intensely accumulating information on how to go about creating ...

Steven Schveighoffer (14/33) Feb 04 2011 Here is how I would approach it (without doing any research).

Simen kjaeraas (11/23) Feb 04 2011 Question:

Steven Schveighoffer (15/42) Feb 07 2011 The goal is to avoid double-buffering data. So you are using the buffer...

Tomek =?UTF-8?B?U293acWEc2tp?= (26/42) Feb 04 2011 =20

Tomek =?UTF-8?B?U293acWEc2tp?= (27/51) Feb 04 2011 =20
Steven Schveighoffer (12/42) Feb 07 2011 That might not scale well. For instance, if you are accessing the 1500t...

Robert Jacques (5/52) Feb 07 2011 I would consider a tokenizer which can be used for SAX style parsing to ...

Michael Rynn (46/74) Feb 11 2011 XML parser

Tomek =?ISO-8859-2?B?U293afFza2k=?= (46/66) Feb 08 2011 t =20

spir (20/53) Feb 08 2011 That's very similar to what I was thinking at. What I wonder is whether,...
Steven Schveighoffer (6/40) Feb 09 2011 OK, so you mean a buffer other than the I/O buffer. This means double

Tomek =?ISO-8859-2?Q?Sowi=F1ski?= (11/15) Feb 09 2011 =20

Michel Fortin (12/33) Feb 04 2011 I agree it's important, especially when receiving XML over the network,

Tomek =?UTF-8?B?U293acWEc2tp?= (7/15) Feb 04 2011 These are valid concerns. Yet, in overwhelming majority XML documents co...

Michel Fortin (6/21) Feb 04 2011 A memory-mapped file comes from the hard drive too.
Roman Ivanov (18/31) Feb 07 2011 This reminds me of some things I was thinking about when I worked with

Denis Koroskin (26/45) Feb 04 2011 g =
Jacob Carlborg (6/11) Feb 06 2011 I don't think it's up to the parser to decide where the content comes
Kagamin (2/5) Feb 07 2011 Did you measure how much memory is wasted by markup?
Piotr Szturmaj (13/18) Feb 08 2011 Few years ago I needed to write my own parser in Delphi for my XMPP

Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:

I am now intensely accumulating information on how to go about creating a
high-performance parser as it quickly became clear that my old one won't
deliver. And if anything is clear is that memory is the key.

One way is the slicing approach mentioned on this NG, notably used by RapidXML.
I already contacted Marcin (the author) to ensure that using solutions inspired
by his lib is OK with him; it is. But I don't think I'll go this way. One
reason is, surprisingly, performance. RapidXML cannot start parsing until the
entire document is loaded and ready as a random-access string. Then it's
blazingly fast but the time for I/O has already elapsed. Besides, as Marcin
himself said, we need a 100% W3C-compliant implementation and RapidXML isn't
one.

I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

To sum up, I belive memory and overlapping I/O latencies with parsing effort
are pivotal.

Please comment on this.

-- 
Tomek

Feb 04 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 04 Feb 2011 16:02:39 -0500, Tomek Sowiński <just ask.me> wrote:

 I am now intensely accumulating information on how to go about creating  
 a high-performance parser as it quickly became clear that my old one  
 won't deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by  
 RapidXML. I already contacted Marcin (the author) to ensure that using  
 solutions inspired by his lib is OK with him; it is. But I don't think  
 I'll go this way. One reason is, surprisingly, performance. RapidXML  
 cannot start parsing until the entire document is loaded and ready as a  
 random-access string. Then it's blazingly fast but the time for I/O has  
 already elapsed. Besides, as Marcin himself said, we need a 100%  
 W3C-compliant implementation and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range,  
 perhaps assuming bufferized input. That way I can start parsing as soon  
 as the first buffer gets filled. Not to mention that the end result will  
 use much less memory. Plenty of the XML data stream is indents, spaces,  
 and markup -- there's no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing  
 effort are pivotal.

 Please comment on this.

Here is how I would approach it (without doing any research).

First, we need a buffered I/O system where you can easily access and  
manipulate the buffer.  I have proposed one a few months ago in this NG.

Second, I'd implement the XML lib as a range where "front()" gives you an  
XMLNode.  If the XMLNode is an element, it will have eager access to the  
element tag, and lazy access to the attributes and the sub-nodes.  Each  
XMLNode will provide a forward range for the child nodes.

Thus you can "skip" whole elements in the stream by popFront'ing a range,  
and dive deeper via accessing the nodes of the range.

I'm unsure how well this will work, or if you can accomplish all of it  
without reallocation (in particular, you may need to store the element  
information, maybe via a specialized member function?).

-Steve

Feb 04 2011

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Steven Schveighoffer <schveiguy yahoo.com> wrote:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and  
 manipulate the buffer.  I have proposed one a few months ago in this NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an XMLNode.  If the XMLNode is an element, it will have eager access to  
 the element tag, and lazy access to the attributes and the sub-nodes.   
 Each XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range, and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it  
 without reallocation (in particular, you may need to store the element  
 information, maybe via a specialized member function?).

Question:

For the lazily computed attributes and subnodes, will accessing one element
cause all elements to be computed? Same goes for getting the number of
elements.

Also, can this be efficiently combined with mmapping? What I sorta imagine
is a kind of lazy slice: It determines whether it ends within this page,  
and
if not, does not progress past that page until asked to do so.

-- 
Simen

Feb 04 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 04 Feb 2011 17:03:08 -0500, Simen kjaeraas  
<simen.kjaras gmail.com> wrote:

 Steven Schveighoffer <schveiguy yahoo.com> wrote:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and  
 manipulate the buffer.  I have proposed one a few months ago in this NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an XMLNode.  If the XMLNode is an element, it will have eager access to  
 the element tag, and lazy access to the attributes and the sub-nodes.   
 Each XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range, and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it  
 without reallocation (in particular, you may need to store the element  
 information, maybe via a specialized member function?).

 Question:

 For the lazily computed attributes and subnodes, will accessing one  
 element
 cause all elements to be computed? Same goes for getting the number of
 elements.

The goal is to avoid double-buffering data.  So you are using the buffer  
of the input stream to contain all data.  So, advancing to the 'next'  
element/node/attribute makes the previous element/node/attribute invalid  
(i.e. the buffer is reused).

The trick is to make it seem like the node is fully there without actually  
reading the stream until you need it (hence the lazy part), because  
reading the entire node means reading the entire file (in the case of the  
root element).

 Also, can this be efficiently combined with mmapping? What I sorta  
 imagine
 is a kind of lazy slice: It determines whether it ends within this page,  
 and
 if not, does not progress past that page until asked to do so.

mmaping would make things more accessible, but the common denominator is  
not mmap.  If it's supported as a special case, then maybe it can offer  
some interesting features, but something like mmap can't be done for say a  
network stream.

-Steve

Feb 07 2011

Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:

Steven Schveighoffer napisa=C5=82:

 Here is how I would approach it (without doing any research).
=20
 First, we need a buffered I/O system where you can easily access and =20
 manipulate the buffer.  I have proposed one a few months ago in this NG.
=20
 Second, I'd implement the XML lib as a range where "front()" gives you an=

 =20
 XMLNode.  If the XMLNode is an element, it will have eager access to the =

=20
 element tag, and lazy access to the attributes and the sub-nodes.  Each =

=20
 XMLNode will provide a forward range for the child nodes.
=20
 Thus you can "skip" whole elements in the stream by popFront'ing a range,=

 =20
 and dive deeper via accessing the nodes of the range.
=20
 I'm unsure how well this will work, or if you can accomplish all of it =20
 without reallocation (in particular, you may need to store the element =20
 information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I converge=
d to a pretty much same concept, so your comment is reassuring :).

The design I'm thinking is that the node iterator will own a buffer. One co=
nsequence is that the fields of the current node will point to the buffer a=
kin to foreach(line; File.byLine), so in order to lift the input the user w=
ill have to dup (or process the node in-place). As new nodes will be overwr=
itten on the same piece of memory, an important trait of the design emerges=
: cache intensity. Because of XML namespaces I think it is necessary for th=
e buffer to contain the current node plus all its parents. Namespaces are t=
he technical reason but having access to the path all the way to the root n=
ode is of value, regardless. This suggests mark-release memory management. =
The buffer will have to be long enough to fit the deepest tag sequence: the=
oretically infinite, not that much in practice. Like I said, the buffer wil=
l be owned by the iterator so probably deterministic deallocation is possib=
le when the processing is done.

One drawback is that you won't know you're dealing with a well-formed DOM u=
ntil the closing tag comes. If it doesn't, it'll of course throw, but the m=
alformed DOM may already have been digested. So providing some rollback pos=
sibility is up to the user.

--=20
Tomek

Feb 04 2011

Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:

 Steven Schveighoffer napisa=C5=82:
=20
 Here is how I would approach it (without doing any research).
=20
 First, we need a buffered I/O system where you can easily access and =20
 manipulate the buffer.  I have proposed one a few months ago in this NG.
=20
 Second, I'd implement the XML lib as a range where "front()" gives you =


an =20
 XMLNode.  If the XMLNode is an element, it will have eager access to th=


e =20
 element tag, and lazy access to the attributes and the sub-nodes.  Each=


 =20
 XMLNode will provide a forward range for the child nodes.
=20
 Thus you can "skip" whole elements in the stream by popFront'ing a rang=


e, =20
 and dive deeper via accessing the nodes of the range.
=20
 I'm unsure how well this will work, or if you can accomplish all of it =


=20
 without reallocation (in particular, you may need to store the element =


=20
 information, maybe via a specialized member function?).

=20
 Heh, yesterday when I couldn't sleep I was sketching the design. I conver=

ged to a pretty much same concept, so your comment is reassuring :).
=20
 The design I'm thinking is that the node iterator will own a buffer. One =

consequence is that the fields of the current node will point to the buffer=
 akin to foreach(line; File.byLine), so in order to lift the input the user=
 will have to dup (or process the node in-place). As new nodes will be over=
written on the same piece of memory, an important trait of the design emerg=
es: cache intensity. Because of XML namespaces I think it is necessary for =
the buffer to contain the current node plus all its parents. Namespaces are=
 the technical reason but having access to the path all the way to the root=
 node is of value, regardless. This suggests mark-release memory management=
. The buffer will have to be long enough to fit the deepest tag sequence: t=
heoretically infinite, not that much in practice. Like I said, the buffer w=
ill be owned by the iterator so probably deterministic deallocation is poss=
ible when the processing is done.
=20
 One drawback is that you won't know you're dealing with a well-formed DOM=

 until the closing tag comes. If it doesn't, it'll of course throw, but the=
 malformed DOM may already have been digested. So providing some rollback p=
ossibility is up to the user.
=20
Oh, and the direction of iteration (deeper/farther) will of course be contr=
ollable in fashion you presented.

--=20
Tomek

Feb 04 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and
 manipulate the buffer.  I have proposed one a few months ago in this NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an
 XMLNode.  If the XMLNode is an element, it will have eager access to the
 element tag, and lazy access to the attributes and the sub-nodes.  Each
 XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range,
 and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it
 without reallocation (in particular, you may need to store the element
 information, maybe via a specialized member function?).

 Heh, yesterday when I couldn't sleep I was sketching the design. I  
 converged to a pretty much same concept, so your comment is reassuring  
 :).

 The design I'm thinking is that the node iterator will own a buffer. One  
 consequence is that the fields of the current node will point to the  
 buffer akin to foreach(line; File.byLine), so in order to lift the input  
 the user will have to dup (or process the node in-place). As new nodes  
 will be overwritten on the same piece of memory, an important trait of  
 the design emerges: cache intensity. Because of XML namespaces I think  
 it is necessary for the buffer to contain the current node plus all its  
 parents.

That might not scale well.  For instance, if you are accessing the 1500th  
child element of a parent, doesn't that mean that the buffer must contain  
the full text for the previous 1499 elements in order to also contain the  
parent?

Maybe I'm misunderstanding what you mean.

I would start out with a non-compliant parser, but one that allocates  
nothing beyond the I/O buffer, one that simply parses lazily and can be  
used as well as a SAX parser.  Then see how much extra allocations we need  
to get it to be compliant.  Then, one can choose the compliancy level  
based on what performance penalties one is willing to incur.

-Steve

Feb 07 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and
 manipulate the buffer.  I have proposed one a few months ago in this  
 NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an
 XMLNode.  If the XMLNode is an element, it will have eager access to  
 the
 element tag, and lazy access to the attributes and the sub-nodes.  Each
 XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range,
 and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it
 without reallocation (in particular, you may need to store the element
 information, maybe via a specialized member function?).

 Heh, yesterday when I couldn't sleep I was sketching the design. I  
 converged to a pretty much same concept, so your comment is reassuring  
 :).

 The design I'm thinking is that the node iterator will own a buffer.  
 One consequence is that the fields of the current node will point to  
 the buffer akin to foreach(line; File.byLine), so in order to lift the  
 input the user will have to dup (or process the node in-place). As new  
 nodes will be overwritten on the same piece of memory, an important  
 trait of the design emerges: cache intensity. Because of XML namespaces  
 I think it is necessary for the buffer to contain the current node plus  
 all its parents.

 That might not scale well.  For instance, if you are accessing the  
 1500th child element of a parent, doesn't that mean that the buffer must  
 contain the full text for the previous 1499 elements in order to also  
 contain the parent?

 Maybe I'm misunderstanding what you mean.

 I would start out with a non-compliant parser, but one that allocates  
 nothing beyond the I/O buffer, one that simply parses lazily and can be  
 used as well as a SAX parser.  Then see how much extra allocations we  
 need to get it to be compliant.  Then, one can choose the compliancy  
 level based on what performance penalties one is willing to incur.

 -Steve

I would consider a tokenizer which can be used for SAX style parsing to be  
a key feature of std.xml. I know it was considered very important when I  
was gathering requirements for my std.JSON re-write.

Feb 07 2011

Michael Rynn <michaelrynn optusnet.com.au> writes:

On Mon, 07 Feb 2011 10:37:46 -0500, Robert Jacques wrote:

 On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:
 
 On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and
 manipulate the buffer.  I have proposed one a few months ago in this
 NG.

 Second, I'd implement the XML lib as a range where "front()" gives
 you an
 XMLNode.  If the XMLNode is an element, it will have eager access to
 the
 element tag, and lazy access to the attributes and the sub-nodes. 
 Each XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a
 range,
 and dive deeper via accessing the nodes of the range.




 
 I would consider a tokenizer which can be used for SAX style parsing to
 be a key feature of std.xml. I know it was considered very important
 when I was gathering requirements for my std.JSON re-write.

XML parser
/dsource/xmlp/trunk/std

I have experimented with various means to balance efficiency and 
flexibility with XML parsing.

The core parsing uses an ItemReturn struct.
	This returns transient pointers to reused buffers, so that there 
is reduced memory buffer reallocation for just churning through the XML 
document.

	
struct ItemReturn {
	ItemType		type = ItemType.RET_NULL;
	char[]			scratch;
	char[][]		names;
	char[][]		values;
}

The central parse method fills the ItemReturn with transient tag names, 
and pointers to names and values, somewhat like a SAX parser.

To measuring performance components,  take the throughput of making a XML 
document using a linked DOM tree structure as 100%, with validation, 
attribute normalisation.

This implementation, buffers for file or string sources, I get this 
breakdown of processing. This is done on books.xml.   Other examples of 
documents, with different structure, entities, DTD, schema, namespaces, 
will differ.

Input overhead.
Throughput of examining each single Unicode character in the document, as 
a dchar value.   About 12-15% of time. So there is not a relatively great 
cost in the input buffering.

Tag, attribute and content throughput.
Parsing and filling the ItemReturn struct for each parse method call, 
called for every identifyable XML element token in the document sequence, 
about 60% of time, without doing anything with the result. This also 
includes Input Overhead.  No DOM structure is assumed, and their is no 
recursion.  The general sort of work done is keeping track of state, and 
assembling and returning the various types of tokens.

To actually build a full DOM, without doing much in the way of 
validation, and attribute normalisation, is up to about 86% of total time.
This includes converting the returned transient buffered values of tags, 
attributes names, values, and content into string, and the creation and 
linking of DOM nodes.   This involves some recursive method calls that 
mirror the XML document structure. 
Some time and memory seems to be saved by aliasing tag and attribute 
names using an AA. This takes about 85% of the full job.
 

Additional validation and attribute normalisation takes more time.

Feb 11 2011

Tomek =?ISO-8859-2?B?U293afFza2k=?= <just ask.me> writes:

Steven Schveighoffer napisa=B3:

 The design I'm thinking is that the node iterator will own a buffer. On=


e =20
 consequence is that the fields of the current node will point to the =20
 buffer akin to foreach(line; File.byLine), so in order to lift the inpu=


t =20
 the user will have to dup (or process the node in-place). As new nodes =


=20
 will be overwritten on the same piece of memory, an important trait of =


=20
 the design emerges: cache intensity. Because of XML namespaces I think =


=20
 it is necessary for the buffer to contain the current node plus all its=


 =20
 parents. =20

=20
 That might not scale well.  For instance, if you are accessing the 1500th=

 =20
 child element of a parent, doesn't that mean that the buffer must contain=

 =20
 the full text for the previous 1499 elements in order to also contain the=

 =20
 parent?
=20
 Maybe I'm misunderstanding what you mean.

Let's talk on an example:

<a name=3D"value">
	<b>
		Some Text 1
		<c2>      <!-- HERE -->
		Some text 2
		</c2>
		Some Text 3
	</b>
</a>

The buffer of the iterator positioned HERE would be:

[Node a | Node b | Node c2]

Node c2 and all its parents are available for inspection. Node a's attribut=
e is stored in the buffer, but not b's "Some text 1" as it is c2's sibling;=
 "Some text 1" was available in the previous iteration, now it's overwritte=
n by c2. To get to "Some text 2" let's advance the iterator in depth to get:

[Node a | Node b | Node c2 | Text node "Some text 2"]

Advancing it once more we get to:

[Node a | Node b | Text node "Some text 3"]

So "Some text 3" is written where c2 and the text node 2 used to be.

The element type of the range would always be the child, parents available =
through pointers:

foreach (node; xmlRange) {
	doStuff(node);
	if (Node* parent =3D node.parent)
		doOtherStuff(parent);
}

Having no access to siblings is quite limiting but the iterator can form an=
 efficient (zero-allocation) basis on which more convenient schemes are bui=
lt upon. It's still just brain-storming, though. I fear there's something t=
hat'll make the whole idea crash & burn.

 I would start out with a non-compliant parser, but one that allocates =20
 nothing beyond the I/O buffer, one that simply parses lazily and can be =

=20
 used as well as a SAX parser.  Then see how much extra allocations we nee=

d =20
 to get it to be compliant.  Then, one can choose the compliancy level =20
 based on what performance penalties one is willing to incur.

Yeah, 100% compliance is a long way.

--=20
Tomek

Feb 08 2011

spir <denis.spir gmail.com> writes:

On 02/09/2011 01:16 AM, Tomek Sowiński wrote:
 Steven Schveighoffer napisał:

 The design I'm thinking is that the node iterator will own a buffer. One
 consequence is that the fields of the current node will point to the
 buffer akin to foreach(line; File.byLine), so in order to lift the input
 the user will have to dup (or process the node in-place). As new nodes
 will be overwritten on the same piece of memory, an important trait of
 the design emerges: cache intensity. Because of XML namespaces I think
 it is necessary for the buffer to contain the current node plus all its
 parents.

 That might not scale well.  For instance, if you are accessing the 1500th
 child element of a parent, doesn't that mean that the buffer must contain
 the full text for the previous 1499 elements in order to also contain the
 parent?

 Maybe I'm misunderstanding what you mean.

 Let's talk on an example:

 <a name="value">
 	<b>
 		Some Text 1
 		<c2>       <!-- HERE -->
 		Some text 2
 		</c2>
 		Some Text 3
 	</b>
 </a>

 The buffer of the iterator positioned HERE would be:

 [Node a | Node b | Node c2]

 Node c2 and all its parents are available for inspection. Node a's attribute
is stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some
text 1" was available in the previous iteration, now it's overwritten by c2. To
get to "Some text 2" let's advance the iterator in depth to get:

 [Node a | Node b | Node c2 | Text node "Some text 2"]

 Advancing it once more we get to:

 [Node a | Node b | Text node "Some text 3"]

 So "Some text 3" is written where c2 and the text node 2 used to be.

That's very similar to what I was thinking at. What I wonder is whether, in 
your buffer representations above, 'Node x' represents an instanciated node, or 
collected data necessary to later instanciate --once the current part of the 
source is validated (proved correct). What i mean is, the whole <a> node will 
be validated only when </a> is matched, so that if your parsing process fails 
on the way (and/or it was just following a wrong parsing path), then all nodes 
instanciated along tha way are just to throw away, aren't they (unless some 
memoisation may be useful).
Possibly all what say here is just stupid, depending on the parsing algo and 
nature of the grammar, and also how costly Node creation is. In my case, all of 
this seems relevant. Thus, I'm thinking at just collecting data along the way 
(rather easy & far cheaper than node construction), and (recursively) 
instanciate only once a section is validated. (needs to be tried concretely 
--maybe there are issues I'm not yet aware of)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Feb 08 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 08 Feb 2011 19:16:37 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 The design I'm thinking is that the node iterator will own a buffer.  

 One
 consequence is that the fields of the current node will point to the
 buffer akin to foreach(line; File.byLine), so in order to lift the  

 input
 the user will have to dup (or process the node in-place). As new nodes
 will be overwritten on the same piece of memory, an important trait of
 the design emerges: cache intensity. Because of XML namespaces I think
 it is necessary for the buffer to contain the current node plus all  

 its
 parents.

 That might not scale well.  For instance, if you are accessing the  
 1500th
 child element of a parent, doesn't that mean that the buffer must  
 contain
 the full text for the previous 1499 elements in order to also contain  
 the
 parent?

 Maybe I'm misunderstanding what you mean.

 Let's talk on an example:

 <a name="value">
 	<b>
 		Some Text 1
 		<c2>      <!-- HERE -->
 		Some text 2
 		</c2>
 		Some Text 3
 	</b>
 </a>

 The buffer of the iterator positioned HERE would be:

 [Node a | Node b | Node c2]

OK, so you mean a buffer other than the I/O buffer.  This means double  
buffering data.  I was thinking of a solution that allows simply using the  
I/O buffer for parsing.  I think this is one of the keys to Tango's xml  
performance.

-Steve

Feb 09 2011

Tomek =?ISO-8859-2?Q?Sowi=F1ski?= <just ask.me> writes:

Steven Schveighoffer napisa=B3:

 OK, so you mean a buffer other than the I/O buffer.  This means double =20
 buffering data.  I was thinking of a solution that allows simply using th=

e =20
 I/O buffer for parsing.  I think this is one of the keys to Tango's xml =

=20
 performance.

I'd be glad to hear what's your idea. I think they are convergent. In mine,=
 the I/O could be asked to dump data to the iterator's buffer at a given po=
sition (right to previous nodes), then the iterator forms a node out of raw=
 data. Some moving would be done but all within the cached buffer so should=
 be quick. I guess it's as far as I can predict performance in a newsgroup =
post. ;-) Gotta write some code and whip out the stopwatch, then we'll see.

--=20
Tomek

Feb 09 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-02-04 16:02:39 -0500, Tomek Sowiński <just ask.me> said:

 I am now intensely accumulating information on how to go about creating 
 a high-performance parser as it quickly became clear that my old one 
 won't deliver. And if anything is clear is that memory is the key.
 
 One way is the slicing approach mentioned on this NG, notably used by 
 RapidXML. I already contacted Marcin (the author) to ensure that using 
 solutions inspired by his lib is OK with him; it is. But I don't think 
 I'll go this way. One reason is, surprisingly, performance. RapidXML 
 cannot start parsing until the entire document is loaded and ready as a 
 random-access string. Then it's blazingly fast but the time for I/O has 
 already elapsed. Besides, as Marcin himself said, we need a 100% 
 W3C-compliant implementation and RapidXML isn't one.
 
 I think a much more fertile approach is to operate on a forward range, 
 perhaps assuming bufferized input. That way I can start parsing as soon 
 as the first buffer gets filled. Not to mention that the end result 
 will use much less memory. Plenty of the XML data stream is indents, 
 spaces, and markup -- there's no reason to copy all this into memory.
 
 To sum up, I belive memory and overlapping I/O latencies with parsing 
 effort are pivotal.

I agree it's important, especially when receiving XML over the network, 
but I also think it's important to be able to be able to support 
slicing. Imagine all the memory you could save by just making slices of 
a memory-mapped file.

The difficulty is to support both models: the input range model which 
requires copying the strings and the slicing model where you're just 
taking slices of a string.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 04 2011

Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:

Michel Fortin napisa=C5=82:

 I agree it's important, especially when receiving XML over the network,=20
 but I also think it's important to be able to be able to support=20
 slicing. Imagine all the memory you could save by just making slices of=20
 a memory-mapped file.
=20
 The difficulty is to support both models: the input range model which=20
 requires copying the strings and the slicing model where you're just=20
 taking slices of a string.

These are valid concerns. Yet, in overwhelming majority XML documents come =
from hard-drive and network -- these are the places we need to drill. I fea=
r that trying to cover every remote use case will render the library incomp=
rehensible.

--=20
Tomek

Feb 04 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-02-04 16:47:26 -0500, Tomek Sowiński <just ask.me> said:

 Michel Fortin napisał:
 
 I agree it's important, especially when receiving XML over the network,
 but I also think it's important to be able to be able to support
 slicing. Imagine all the memory you could save by just making slices of
 a memory-mapped file.
 
 The difficulty is to support both models: the input range model which
 requires copying the strings and the slicing model where you're just
 taking slices of a string.

 
 These are valid concerns. Yet, in overwhelming majority XML documents come
 from hard-drive and network -- these are the places we need to drill. I fea
 r that trying to cover every remote use case will render the library incomp
 rehensible.

A memory-mapped file comes from the hard drive too.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 04 2011

Roman Ivanov <isroman.DEL ETE.km.ru> writes:

On 2/4/2011 4:47 PM, Tomek Sowiński wrote:
 Michel Fortin napisał:
 
 I agree it's important, especially when receiving XML over the network, 
 but I also think it's important to be able to be able to support 
 slicing. Imagine all the memory you could save by just making slices of 
 a memory-mapped file.

 The difficulty is to support both models: the input range model which 
 requires copying the strings and the slicing model where you're just 
 taking slices of a string.

 
 These are valid concerns. Yet, in overwhelming majority XML documents come
from hard-drive and network -- these are the places we need to drill. I fear
that trying to cover every remote use case will render the library
incomprehensible.
 

This reminds me of some things I was thinking about when I worked with
some XML-heavy apps in Java and experimented with writing parsers for my
own simple markup languages.

If you have the entire XML string loaded in memory, the most
time-consuming part of parsing it is probably going to be allocation of
node objects. So it makes sense to do a quick scan of the char array,
and generate just the root node, which would lazily allocate sub-nodes
upon access.

I can see several different implementation of a high-performance parser,
depending on the typical use case. Do you want to work efficiently with
lots of small files or one huge file? Deeply nested or mostly flat?
Coming from memory or a stream of characters?

Problem is, with lazy parsing XML nodes would need to be able to call
upon the parser that created them.

Perhaps it would be possible to specify some kind of generic XML node
interface and allow people to use/generate different implementations
depending on what they need?

Feb 07 2011

"Denis Koroskin" <2korden gmail.com> writes:

On Sat, 05 Feb 2011 00:02:39 +0300, Tomek Sowi=C5=84ski <just ask.me> wr=
ote:

 I am now intensely accumulating information on how to go about creatin=

g  =

 a high-performance parser as it quickly became clear that my old one  =

 won't deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by =

 =

 RapidXML. I already contacted Marcin (the author) to ensure that using=

  =

 solutions inspired by his lib is OK with him; it is. But I don't think=

  =

 I'll go this way. One reason is, surprisingly, performance. RapidXML  =

 cannot start parsing until the entire document is loaded and ready as =

a  =

 random-access string. Then it's blazingly fast but the time for I/O ha=

s  =

 already elapsed. Besides, as Marcin himself said, we need a 100%  =

 W3C-compliant implementation and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range,=

  =

 perhaps assuming bufferized input. That way I can start parsing as soo=

n  =

 as the first buffer gets filled. Not to mention that the end result wi=

ll  =

 use much less memory. Plenty of the XML data stream is indents, spaces=

,  =

 and markup -- there's no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing =

 =

 effort are pivotal.

 Please comment on this.

I don't have much experience with XML, but as far as I can tell DOM pars=
er  =

pretty much needs to store entire file in memory. You can also load and =
 =

parse files asynchronously.
By the contrary, SAX parsers don't require having entire file in memory,=
  =

but that's completely different approach to XML parsing.


I'd also recommend you to take a look at pugixml, which is being develop=
ed  =

and supported by my co-worker since 2006. It is free (MIT license), smal=
l,  =

lightweight, fast, clean and has very good documentation.

Feb 04 2011

Jacob Carlborg <doob me.com> writes:

On 2011-02-04 22:02, Tomek Sowiński wrote:
 I am now intensely accumulating information on how to go about creating a
high-performance parser as it quickly became clear that my old one won't
deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by
RapidXML. I already contacted Marcin (the author) to ensure that using
solutions inspired by his lib is OK with him; it is. But I don't think I'll go
this way. One reason is, surprisingly, performance. RapidXML cannot start
parsing until the entire document is loaded and ready as a random-access
string. Then it's blazingly fast but the time for I/O has already elapsed.
Besides, as Marcin himself said, we need a 100% W3C-compliant implementation
and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing effort
are pivotal.

 Please comment on this.

I don't think it's up to the parser to decide where the content comes 
from. It should be able to handle the whole content of an XML file in 
memory.

-- 
/Jacob Carlborg

Feb 06 2011

Kagamin <spam here.lot> writes:

Tomek Sowiński Wrote:

 One way is the slicing approach mentioned on this NG, notably used by
RapidXML. I already contacted Marcin (the author) to ensure that using
solutions inspired by his lib is OK with him; it is. But I don't think I'll go
this way. One reason is, surprisingly, performance. RapidXML cannot start
parsing until the entire document is loaded and ready as a random-access
string. Then it's blazingly fast but the time for I/O has already elapsed.
Besides, as Marcin himself said, we need a 100% W3C-compliant implementation
and RapidXML isn't one.
 
 I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

Did you measure how much memory is wasted by markup?

Feb 07 2011

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Tomek Sowiński wrote:
 I am now intensely accumulating information on how to go about creating a
high-performance parser as it quickly became clear that my old one won't
deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by
RapidXML. I already contacted Marcin (the author) to ensure that using
solutions inspired by his lib is OK with him; it is. But I don't think I'll go
this way. One reason is, surprisingly, performance. RapidXML cannot start
parsing until the entire document is loaded and ready as a random-access
string. Then it's blazingly fast but the time for I/O has already elapsed.
Besides, as Marcin himself said, we need a 100% W3C-compliant implementation
and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing effort
are pivotal.

 Please comment on this.

Few years ago I needed to write my own parser in Delphi for my XMPP 
(Jabber) client and server. In XMPP you get socket-streamed XML document 
with xml elements as protocol messages ("XMPP stanzas").
The problem I had was no Delphi parser had hybrid support of SAX/DOM, 
i.e. I wanted to parse xml nodes like SAX, but when I received whole 
message then I wanted to return it as XML Element (like in DOM). This 
way I could easily process incoming messages - now it's accomplishable 
with pull parsers.

I think std needs both SAX/pull and DOM parsers. For DOM, if whole 
document is in memory, maybe this approach could be advantageous:

http://en.wikipedia.org/wiki/VTD-XML
http://vtd-xml.sourceforge.net/

Feb 08 2011

D Programming

C/C++ Programming

Other

digitalmars.D - High performance XML parser