www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - High performance XML parser

reply Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:
I am now intensely accumulating information on how to go about creating a
high-performance parser as it quickly became clear that my old one won't
deliver. And if anything is clear is that memory is the key.

One way is the slicing approach mentioned on this NG, notably used by RapidXML.
I already contacted Marcin (the author) to ensure that using solutions inspired
by his lib is OK with him; it is. But I don't think I'll go this way. One
reason is, surprisingly, performance. RapidXML cannot start parsing until the
entire document is loaded and ready as a random-access string. Then it's
blazingly fast but the time for I/O has already elapsed. Besides, as Marcin
himself said, we need a 100% W3C-compliant implementation and RapidXML isn't
one.

I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

To sum up, I belive memory and overlapping I/O latencies with parsing effort
are pivotal.

Please comment on this.

-- 
Tomek
Feb 04 2011
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 04 Feb 2011 16:02:39 -0500, Tomek Sowiński <just ask.me> wrote:

 I am now intensely accumulating information on how to go about creating  
 a high-performance parser as it quickly became clear that my old one  
 won't deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by  
 RapidXML. I already contacted Marcin (the author) to ensure that using  
 solutions inspired by his lib is OK with him; it is. But I don't think  
 I'll go this way. One reason is, surprisingly, performance. RapidXML  
 cannot start parsing until the entire document is loaded and ready as a  
 random-access string. Then it's blazingly fast but the time for I/O has  
 already elapsed. Besides, as Marcin himself said, we need a 100%  
 W3C-compliant implementation and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range,  
 perhaps assuming bufferized input. That way I can start parsing as soon  
 as the first buffer gets filled. Not to mention that the end result will  
 use much less memory. Plenty of the XML data stream is indents, spaces,  
 and markup -- there's no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing  
 effort are pivotal.

 Please comment on this.

Here is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range. I'm unsure how well this will work, or if you can accomplish all of it without reallocation (in particular, you may need to store the element information, maybe via a specialized member function?). -Steve
Feb 04 2011
parent spir <denis.spir gmail.com> writes:
On 02/09/2011 01:16 AM, Tomek Sowiński wrote:
 Steven Schveighoffer napisał:

 The design I'm thinking is that the node iterator will own a buffer. One
 consequence is that the fields of the current node will point to the
 buffer akin to foreach(line; File.byLine), so in order to lift the input
 the user will have to dup (or process the node in-place). As new nodes
 will be overwritten on the same piece of memory, an important trait of
 the design emerges: cache intensity. Because of XML namespaces I think
 it is necessary for the buffer to contain the current node plus all its
 parents.

That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean.

Let's talk on an example: <a name="value"> <b> Some Text 1 <c2> <!-- HERE --> Some text 2 </c2> Some Text 3 </b> </a> The buffer of the iterator positioned HERE would be: [Node a | Node b | Node c2] Node c2 and all its parents are available for inspection. Node a's attribute is stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some text 1" was available in the previous iteration, now it's overwritten by c2. To get to "Some text 2" let's advance the iterator in depth to get: [Node a | Node b | Node c2 | Text node "Some text 2"] Advancing it once more we get to: [Node a | Node b | Text node "Some text 3"] So "Some text 3" is written where c2 and the text node 2 used to be.

That's very similar to what I was thinking at. What I wonder is whether, in your buffer representations above, 'Node x' represents an instanciated node, or collected data necessary to later instanciate --once the current part of the source is validated (proved correct). What i mean is, the whole <a> node will be validated only when </a> is matched, so that if your parsing process fails on the way (and/or it was just following a wrong parsing path), then all nodes instanciated along tha way are just to throw away, aren't they (unless some memoisation may be useful). Possibly all what say here is just stupid, depending on the parsing algo and nature of the grammar, and also how costly Node creation is. In my case, all of this seems relevant. Thus, I'm thinking at just collecting data along the way (rather easy & far cheaper than node construction), and (recursively) instanciate only once a section is validated. (needs to be tried concretely --maybe there are issues I'm not yet aware of) Denis -- _________________ vita es estrany spir.wikidot.com
Feb 08 2011
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-02-04 16:02:39 -0500, Tomek Sowiński <just ask.me> said:

 I am now intensely accumulating information on how to go about creating 
 a high-performance parser as it quickly became clear that my old one 
 won't deliver. And if anything is clear is that memory is the key.
 
 One way is the slicing approach mentioned on this NG, notably used by 
 RapidXML. I already contacted Marcin (the author) to ensure that using 
 solutions inspired by his lib is OK with him; it is. But I don't think 
 I'll go this way. One reason is, surprisingly, performance. RapidXML 
 cannot start parsing until the entire document is loaded and ready as a 
 random-access string. Then it's blazingly fast but the time for I/O has 
 already elapsed. Besides, as Marcin himself said, we need a 100% 
 W3C-compliant implementation and RapidXML isn't one.
 
 I think a much more fertile approach is to operate on a forward range, 
 perhaps assuming bufferized input. That way I can start parsing as soon 
 as the first buffer gets filled. Not to mention that the end result 
 will use much less memory. Plenty of the XML data stream is indents, 
 spaces, and markup -- there's no reason to copy all this into memory.
 
 To sum up, I belive memory and overlapping I/O latencies with parsing 
 effort are pivotal.

I agree it's important, especially when receiving XML over the network, but I also think it's important to be able to be able to support slicing. Imagine all the memory you could save by just making slices of a memory-mapped file. The difficulty is to support both models: the input range model which requires copying the strings and the slicing model where you're just taking slices of a string. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 04 2011
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-02-04 16:47:26 -0500, Tomek Sowiński <just ask.me> said:

 Michel Fortin napisał:
 
 I agree it's important, especially when receiving XML over the network,
 but I also think it's important to be able to be able to support
 slicing. Imagine all the memory you could save by just making slices of
 a memory-mapped file.
 
 The difficulty is to support both models: the input range model which
 requires copying the strings and the slicing model where you're just
 taking slices of a string.

These are valid concerns. Yet, in overwhelming majority XML documents come from hard-drive and network -- these are the places we need to drill. I fea r that trying to cover every remote use case will render the library incomp rehensible.

A memory-mapped file comes from the hard drive too. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 04 2011
prev sibling parent Roman Ivanov <isroman.DEL ETE.km.ru> writes:
On 2/4/2011 4:47 PM, Tomek Sowiński wrote:
 Michel Fortin napisał:
 
 I agree it's important, especially when receiving XML over the network, 
 but I also think it's important to be able to be able to support 
 slicing. Imagine all the memory you could save by just making slices of 
 a memory-mapped file.

 The difficulty is to support both models: the input range model which 
 requires copying the strings and the slicing model where you're just 
 taking slices of a string.

These are valid concerns. Yet, in overwhelming majority XML documents come from hard-drive and network -- these are the places we need to drill. I fear that trying to cover every remote use case will render the library incomprehensible.

This reminds me of some things I was thinking about when I worked with some XML-heavy apps in Java and experimented with writing parsers for my own simple markup languages. If you have the entire XML string loaded in memory, the most time-consuming part of parsing it is probably going to be allocation of node objects. So it makes sense to do a quick scan of the char array, and generate just the root node, which would lazily allocate sub-nodes upon access. I can see several different implementation of a high-performance parser, depending on the typical use case. Do you want to work efficiently with lots of small files or one huge file? Deeply nested or mostly flat? Coming from memory or a stream of characters? Problem is, with lazy parsing XML nodes would need to be able to call upon the parser that created them. Perhaps it would be possible to specify some kind of generic XML node interface and allow people to use/generate different implementations depending on what they need?
Feb 07 2011
prev sibling next sibling parent Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:
Michel Fortin napisa=C5=82:

 I agree it's important, especially when receiving XML over the network,=20
 but I also think it's important to be able to be able to support=20
 slicing. Imagine all the memory you could save by just making slices of=20
 a memory-mapped file.
=20
 The difficulty is to support both models: the input range model which=20
 requires copying the strings and the slicing model where you're just=20
 taking slices of a string.

These are valid concerns. Yet, in overwhelming majority XML documents come = from hard-drive and network -- these are the places we need to drill. I fea= r that trying to cover every remote use case will render the library incomp= rehensible. --=20 Tomek
Feb 04 2011
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Steven Schveighoffer <schveiguy yahoo.com> wrote:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and  
 manipulate the buffer.  I have proposed one a few months ago in this NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an XMLNode.  If the XMLNode is an element, it will have eager access to  
 the element tag, and lazy access to the attributes and the sub-nodes.   
 Each XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range, and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it  
 without reallocation (in particular, you may need to store the element  
 information, maybe via a specialized member function?).

Question: For the lazily computed attributes and subnodes, will accessing one element cause all elements to be computed? Same goes for getting the number of elements. Also, can this be efficiently combined with mmapping? What I sorta imagine is a kind of lazy slice: It determines whether it ends within this page, and if not, does not progress past that page until asked to do so. -- Simen
Feb 04 2011
prev sibling next sibling parent Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:
Steven Schveighoffer napisa=C5=82:

 Here is how I would approach it (without doing any research).
=20
 First, we need a buffered I/O system where you can easily access and =20
 manipulate the buffer.  I have proposed one a few months ago in this NG.
=20
 Second, I'd implement the XML lib as a range where "front()" gives you an=

 XMLNode.  If the XMLNode is an element, it will have eager access to the =

 element tag, and lazy access to the attributes and the sub-nodes.  Each =

 XMLNode will provide a forward range for the child nodes.
=20
 Thus you can "skip" whole elements in the stream by popFront'ing a range,=

 and dive deeper via accessing the nodes of the range.
=20
 I'm unsure how well this will work, or if you can accomplish all of it =20
 without reallocation (in particular, you may need to store the element =20
 information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I converge= d to a pretty much same concept, so your comment is reassuring :). The design I'm thinking is that the node iterator will own a buffer. One co= nsequence is that the fields of the current node will point to the buffer a= kin to foreach(line; File.byLine), so in order to lift the input the user w= ill have to dup (or process the node in-place). As new nodes will be overwr= itten on the same piece of memory, an important trait of the design emerges= : cache intensity. Because of XML namespaces I think it is necessary for th= e buffer to contain the current node plus all its parents. Namespaces are t= he technical reason but having access to the path all the way to the root n= ode is of value, regardless. This suggests mark-release memory management. = The buffer will have to be long enough to fit the deepest tag sequence: the= oretically infinite, not that much in practice. Like I said, the buffer wil= l be owned by the iterator so probably deterministic deallocation is possib= le when the processing is done. One drawback is that you won't know you're dealing with a well-formed DOM u= ntil the closing tag comes. If it doesn't, it'll of course throw, but the m= alformed DOM may already have been digested. So providing some rollback pos= sibility is up to the user. --=20 Tomek
Feb 04 2011
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Sat, 05 Feb 2011 00:02:39 +0300, Tomek Sowi=C5=84ski <just ask.me> wr=
ote:

 I am now intensely accumulating information on how to go about creatin=

 a high-performance parser as it quickly became clear that my old one  =

 won't deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by =

 RapidXML. I already contacted Marcin (the author) to ensure that using=

 solutions inspired by his lib is OK with him; it is. But I don't think=

 I'll go this way. One reason is, surprisingly, performance. RapidXML  =

 cannot start parsing until the entire document is loaded and ready as =

 random-access string. Then it's blazingly fast but the time for I/O ha=

 already elapsed. Besides, as Marcin himself said, we need a 100%  =

 W3C-compliant implementation and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range,=

 perhaps assuming bufferized input. That way I can start parsing as soo=

 as the first buffer gets filled. Not to mention that the end result wi=

 use much less memory. Plenty of the XML data stream is indents, spaces=

 and markup -- there's no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing =

 effort are pivotal.

 Please comment on this.

I don't have much experience with XML, but as far as I can tell DOM pars= er = pretty much needs to store entire file in memory. You can also load and = = parse files asynchronously. By the contrary, SAX parsers don't require having entire file in memory,= = but that's completely different approach to XML parsing. I'd also recommend you to take a look at pugixml, which is being develop= ed = and supported by my co-worker since 2006. It is free (MIT license), smal= l, = lightweight, fast, clean and has very good documentation.
Feb 04 2011
prev sibling next sibling parent Tomek =?UTF-8?B?U293acWEc2tp?= <just ask.me> writes:
 Steven Schveighoffer napisa=C5=82:
=20
 Here is how I would approach it (without doing any research).
=20
 First, we need a buffered I/O system where you can easily access and =20
 manipulate the buffer.  I have proposed one a few months ago in this NG.
=20
 Second, I'd implement the XML lib as a range where "front()" gives you =


 XMLNode.  If the XMLNode is an element, it will have eager access to th=


 element tag, and lazy access to the attributes and the sub-nodes.  Each=


 XMLNode will provide a forward range for the child nodes.
=20
 Thus you can "skip" whole elements in the stream by popFront'ing a rang=


 and dive deeper via accessing the nodes of the range.
=20
 I'm unsure how well this will work, or if you can accomplish all of it =


 without reallocation (in particular, you may need to store the element =


 information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I conver=

=20
 The design I'm thinking is that the node iterator will own a buffer. One =

akin to foreach(line; File.byLine), so in order to lift the input the user= will have to dup (or process the node in-place). As new nodes will be over= written on the same piece of memory, an important trait of the design emerg= es: cache intensity. Because of XML namespaces I think it is necessary for = the buffer to contain the current node plus all its parents. Namespaces are= the technical reason but having access to the path all the way to the root= node is of value, regardless. This suggests mark-release memory management= . The buffer will have to be long enough to fit the deepest tag sequence: t= heoretically infinite, not that much in practice. Like I said, the buffer w= ill be owned by the iterator so probably deterministic deallocation is poss= ible when the processing is done.
=20
 One drawback is that you won't know you're dealing with a well-formed DOM=

malformed DOM may already have been digested. So providing some rollback p= ossibility is up to the user. =20 Oh, and the direction of iteration (deeper/farther) will of course be contr= ollable in fashion you presented. --=20 Tomek
Feb 04 2011
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-02-04 22:02, Tomek Sowiński wrote:
 I am now intensely accumulating information on how to go about creating a
high-performance parser as it quickly became clear that my old one won't
deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by
RapidXML. I already contacted Marcin (the author) to ensure that using
solutions inspired by his lib is OK with him; it is. But I don't think I'll go
this way. One reason is, surprisingly, performance. RapidXML cannot start
parsing until the entire document is loaded and ready as a random-access
string. Then it's blazingly fast but the time for I/O has already elapsed.
Besides, as Marcin himself said, we need a 100% W3C-compliant implementation
and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing effort
are pivotal.

 Please comment on this.

I don't think it's up to the parser to decide where the content comes from. It should be able to handle the whole content of an XML file in memory. -- /Jacob Carlborg
Feb 06 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 04 Feb 2011 17:03:08 -0500, Simen kjaeraas  
<simen.kjaras gmail.com> wrote:

 Steven Schveighoffer <schveiguy yahoo.com> wrote:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and  
 manipulate the buffer.  I have proposed one a few months ago in this NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an XMLNode.  If the XMLNode is an element, it will have eager access to  
 the element tag, and lazy access to the attributes and the sub-nodes.   
 Each XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range, and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it  
 without reallocation (in particular, you may need to store the element  
 information, maybe via a specialized member function?).

Question: For the lazily computed attributes and subnodes, will accessing one element cause all elements to be computed? Same goes for getting the number of elements.

The goal is to avoid double-buffering data. So you are using the buffer of the input stream to contain all data. So, advancing to the 'next' element/node/attribute makes the previous element/node/attribute invalid (i.e. the buffer is reused). The trick is to make it seem like the node is fully there without actually reading the stream until you need it (hence the lazy part), because reading the entire node means reading the entire file (in the case of the root element).
 Also, can this be efficiently combined with mmapping? What I sorta  
 imagine
 is a kind of lazy slice: It determines whether it ends within this page,  
 and
 if not, does not progress past that page until asked to do so.

mmaping would make things more accessible, but the common denominator is not mmap. If it's supported as a special case, then maybe it can offer some interesting features, but something like mmap can't be done for say a network stream. -Steve
Feb 07 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and
 manipulate the buffer.  I have proposed one a few months ago in this NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an
 XMLNode.  If the XMLNode is an element, it will have eager access to the
 element tag, and lazy access to the attributes and the sub-nodes.  Each
 XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range,
 and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it
 without reallocation (in particular, you may need to store the element
 information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I converged to a pretty much same concept, so your comment is reassuring :). The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents.

That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean. I would start out with a non-compliant parser, but one that allocates nothing beyond the I/O buffer, one that simply parses lazily and can be used as well as a SAX parser. Then see how much extra allocations we need to get it to be compliant. Then, one can choose the compliancy level based on what performance penalties one is willing to incur. -Steve
Feb 07 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and
 manipulate the buffer.  I have proposed one a few months ago in this  
 NG.

 Second, I'd implement the XML lib as a range where "front()" gives you  
 an
 XMLNode.  If the XMLNode is an element, it will have eager access to  
 the
 element tag, and lazy access to the attributes and the sub-nodes.  Each
 XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a  
 range,
 and dive deeper via accessing the nodes of the range.

 I'm unsure how well this will work, or if you can accomplish all of it
 without reallocation (in particular, you may need to store the element
 information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I converged to a pretty much same concept, so your comment is reassuring :). The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents.

That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean. I would start out with a non-compliant parser, but one that allocates nothing beyond the I/O buffer, one that simply parses lazily and can be used as well as a SAX parser. Then see how much extra allocations we need to get it to be compliant. Then, one can choose the compliancy level based on what performance penalties one is willing to incur. -Steve

I would consider a tokenizer which can be used for SAX style parsing to be a key feature of std.xml. I know it was considered very important when I was gathering requirements for my std.JSON re-write.
Feb 07 2011
prev sibling next sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Tomek Sowiński wrote:
 I am now intensely accumulating information on how to go about creating a
high-performance parser as it quickly became clear that my old one won't
deliver. And if anything is clear is that memory is the key.

 One way is the slicing approach mentioned on this NG, notably used by
RapidXML. I already contacted Marcin (the author) to ensure that using
solutions inspired by his lib is OK with him; it is. But I don't think I'll go
this way. One reason is, surprisingly, performance. RapidXML cannot start
parsing until the entire document is loaded and ready as a random-access
string. Then it's blazingly fast but the time for I/O has already elapsed.
Besides, as Marcin himself said, we need a 100% W3C-compliant implementation
and RapidXML isn't one.

 I think a much more fertile approach is to operate on a forward range, perhaps
assuming bufferized input. That way I can start parsing as soon as the first
buffer gets filled. Not to mention that the end result will use much less
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's
no reason to copy all this into memory.

 To sum up, I belive memory and overlapping I/O latencies with parsing effort
are pivotal.

 Please comment on this.

Few years ago I needed to write my own parser in Delphi for my XMPP (Jabber) client and server. In XMPP you get socket-streamed XML document with xml elements as protocol messages ("XMPP stanzas"). The problem I had was no Delphi parser had hybrid support of SAX/DOM, i.e. I wanted to parse xml nodes like SAX, but when I received whole message then I wanted to return it as XML Element (like in DOM). This way I could easily process incoming messages - now it's accomplishable with pull parsers. I think std needs both SAX/pull and DOM parsers. For DOM, if whole document is in memory, maybe this approach could be advantageous: http://en.wikipedia.org/wiki/VTD-XML http://vtd-xml.sourceforge.net/
Feb 08 2011
prev sibling next sibling parent Tomek =?ISO-8859-2?B?U293afFza2k=?= <just ask.me> writes:
Steven Schveighoffer napisa=B3:

 The design I'm thinking is that the node iterator will own a buffer. On=


 consequence is that the fields of the current node will point to the =20
 buffer akin to foreach(line; File.byLine), so in order to lift the inpu=


 the user will have to dup (or process the node in-place). As new nodes =


 will be overwritten on the same piece of memory, an important trait of =


 the design emerges: cache intensity. Because of XML namespaces I think =


 it is necessary for the buffer to contain the current node plus all its=


 parents. =20

That might not scale well. For instance, if you are accessing the 1500th=

 child element of a parent, doesn't that mean that the buffer must contain=

 the full text for the previous 1499 elements in order to also contain the=

 parent?
=20
 Maybe I'm misunderstanding what you mean.

Let's talk on an example: <a name=3D"value"> <b> Some Text 1 <c2> <!-- HERE --> Some text 2 </c2> Some Text 3 </b> </a> The buffer of the iterator positioned HERE would be: [Node a | Node b | Node c2] Node c2 and all its parents are available for inspection. Node a's attribut= e is stored in the buffer, but not b's "Some text 1" as it is c2's sibling;= "Some text 1" was available in the previous iteration, now it's overwritte= n by c2. To get to "Some text 2" let's advance the iterator in depth to get: [Node a | Node b | Node c2 | Text node "Some text 2"] Advancing it once more we get to: [Node a | Node b | Text node "Some text 3"] So "Some text 3" is written where c2 and the text node 2 used to be. The element type of the range would always be the child, parents available = through pointers: foreach (node; xmlRange) { doStuff(node); if (Node* parent =3D node.parent) doOtherStuff(parent); } Having no access to siblings is quite limiting but the iterator can form an= efficient (zero-allocation) basis on which more convenient schemes are bui= lt upon. It's still just brain-storming, though. I fear there's something t= hat'll make the whole idea crash & burn.
 I would start out with a non-compliant parser, but one that allocates =20
 nothing beyond the I/O buffer, one that simply parses lazily and can be =

 used as well as a SAX parser.  Then see how much extra allocations we nee=

 to get it to be compliant.  Then, one can choose the compliancy level =20
 based on what performance penalties one is willing to incur.

Yeah, 100% compliance is a long way. --=20 Tomek
Feb 08 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 08 Feb 2011 19:16:37 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 The design I'm thinking is that the node iterator will own a buffer.  

 consequence is that the fields of the current node will point to the
 buffer akin to foreach(line; File.byLine), so in order to lift the  

 the user will have to dup (or process the node in-place). As new nodes
 will be overwritten on the same piece of memory, an important trait of
 the design emerges: cache intensity. Because of XML namespaces I think
 it is necessary for the buffer to contain the current node plus all  

 parents.

That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean.

Let's talk on an example: <a name="value"> <b> Some Text 1 <c2> <!-- HERE --> Some text 2 </c2> Some Text 3 </b> </a> The buffer of the iterator positioned HERE would be: [Node a | Node b | Node c2]

OK, so you mean a buffer other than the I/O buffer. This means double buffering data. I was thinking of a solution that allows simply using the I/O buffer for parsing. I think this is one of the keys to Tango's xml performance. -Steve
Feb 09 2011
prev sibling next sibling parent Tomek =?ISO-8859-2?Q?Sowi=F1ski?= <just ask.me> writes:
Steven Schveighoffer napisa=B3:

 OK, so you mean a buffer other than the I/O buffer.  This means double =20
 buffering data.  I was thinking of a solution that allows simply using th=

 I/O buffer for parsing.  I think this is one of the keys to Tango's xml =

 performance.

I'd be glad to hear what's your idea. I think they are convergent. In mine,= the I/O could be asked to dump data to the iterator's buffer at a given po= sition (right to previous nodes), then the iterator forms a node out of raw= data. Some moving would be done but all within the cached buffer so should= be quick. I guess it's as far as I can predict performance in a newsgroup = post. ;-) Gotta write some code and whip out the stopwatch, then we'll see. --=20 Tomek
Feb 09 2011
prev sibling parent Michael Rynn <michaelrynn optusnet.com.au> writes:
On Mon, 07 Feb 2011 10:37:46 -0500, Robert Jacques wrote:

 On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:
 
 On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:

 Steven Schveighoffer napisał:

 Here is how I would approach it (without doing any research).

 First, we need a buffered I/O system where you can easily access and
 manipulate the buffer.  I have proposed one a few months ago in this
 NG.

 Second, I'd implement the XML lib as a range where "front()" gives
 you an
 XMLNode.  If the XMLNode is an element, it will have eager access to
 the
 element tag, and lazy access to the attributes and the sub-nodes. 
 Each XMLNode will provide a forward range for the child nodes.

 Thus you can "skip" whole elements in the stream by popFront'ing a
 range,
 and dive deeper via accessing the nodes of the range.




 
 I would consider a tokenizer which can be used for SAX style parsing to
 be a key feature of std.xml. I know it was considered very important
 when I was gathering requirements for my std.JSON re-write.

XML parser /dsource/xmlp/trunk/std I have experimented with various means to balance efficiency and flexibility with XML parsing. The core parsing uses an ItemReturn struct. This returns transient pointers to reused buffers, so that there is reduced memory buffer reallocation for just churning through the XML document. struct ItemReturn { ItemType type = ItemType.RET_NULL; char[] scratch; char[][] names; char[][] values; } The central parse method fills the ItemReturn with transient tag names, and pointers to names and values, somewhat like a SAX parser. To measuring performance components, take the throughput of making a XML document using a linked DOM tree structure as 100%, with validation, attribute normalisation. This implementation, buffers for file or string sources, I get this breakdown of processing. This is done on books.xml. Other examples of documents, with different structure, entities, DTD, schema, namespaces, will differ. Input overhead. Throughput of examining each single Unicode character in the document, as a dchar value. About 12-15% of time. So there is not a relatively great cost in the input buffering. Tag, attribute and content throughput. Parsing and filling the ItemReturn struct for each parse method call, called for every identifyable XML element token in the document sequence, about 60% of time, without doing anything with the result. This also includes Input Overhead. No DOM structure is assumed, and their is no recursion. The general sort of work done is keeping track of state, and assembling and returning the various types of tokens. To actually build a full DOM, without doing much in the way of validation, and attribute normalisation, is up to about 86% of total time. This includes converting the returned transient buffered values of tags, attributes names, values, and content into string, and the creation and linking of DOM nodes. This involves some recursive method calls that mirror the XML document structure. Some time and memory seems to be saved by aliasing tag and attribute names using an AA. This takes about 85% of the full job. Additional validation and attribute normalisation takes more time.
Feb 11 2011