www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - simple sax-style xml parser

reply ketmar <ketmar ketmar.no-ip.org> writes:
i wrote a simple sax-style xml parser[1][2] for my own needs, and 
decided to share it. it has two interfaces: `xmparse()` function 
which simply calls callbacks without any validation or encoding 
conversion, and `SaxyEx` class, which does some validation, 
converts content to utf-8 (from anything std.encoding supports), 
and calls callbacks when the given path is triggered.

it can parse any `char` input range, or std.stdio.File. parsing 
files is probably slightly faster than parsing ranges.

internally it is extensively reusing memory buffers it allocated, 
so it should not create a big pressure on GC.

you are expected to copy any data you need in callbacks (not just 
slice, but .dup!).

so far i'm using it to parse fb2 files, and it parsing 8.5 
megabyte utf-8 file (and creating internal reader structures, 
including splitting text to words and some other housekeeping) in 
one second on my i3 (with dmd -O, even without -inline and 
-release).

it is not really documented, but i think it is "intuitive". there 
are also some comments in source code; please, read those! ;-)

p.s. it decodes standard xml entities (&# and &#x probably works 
right only in utf-8 files, though), understands CDATA and 
comments.


enjoy, and happy hacking!


[1] http://repo.or.cz/iv.d.git/blob_plain/HEAD:/saxy.d
[2] http://repo.or.cz/iv.d.git/tree/HEAD:/saxytests
Jul 19 2016
parent reply Chris <wendlec tcd.ie> writes:
On Wednesday, 20 July 2016 at 01:49:37 UTC, ketmar wrote:
 i wrote a simple sax-style xml parser[1][2] for my own needs, 
 and decided to share it. it has two interfaces: `xmparse()` 
 function which simply calls callbacks without any validation or 
 encoding conversion, and `SaxyEx` class, which does some 
 validation, converts content to utf-8 (from anything 
 std.encoding supports), and calls callbacks when the given path 
 is triggered.

 it can parse any `char` input range, or std.stdio.File. parsing 
 files is probably slightly faster than parsing ranges.

 internally it is extensively reusing memory buffers it 
 allocated, so it should not create a big pressure on GC.

 you are expected to copy any data you need in callbacks (not 
 just slice, but .dup!).

 so far i'm using it to parse fb2 files, and it parsing 8.5 
 megabyte utf-8 file (and creating internal reader structures, 
 including splitting text to words and some other housekeeping) 
 in one second on my i3 (with dmd -O, even without -inline and 
 -release).

 it is not really documented, but i think it is "intuitive". 
 there are also some comments in source code; please, read 
 those! ;-)

 p.s. it decodes standard xml entities (&# and &#x probably 
 works right only in utf-8 files, though), understands CDATA and 
 comments.


 enjoy, and happy hacking!


 [1] http://repo.or.cz/iv.d.git/blob_plain/HEAD:/saxy.d
 [2] http://repo.or.cz/iv.d.git/tree/HEAD:/saxytests
Thanks. I might actually use it. I need an XML parser and wrote a very basic and incomplete one for my needs.
Jul 29 2016
parent ketmar <ketmar ketmar.no-ip.org> writes:
On Friday, 29 July 2016 at 14:47:08 UTC, Chris wrote:
 Thanks. I might actually use it. I need an XML parser and wrote 
 a very basic and incomplete one for my needs.
great. don't forget to get lastest versions from that links. and feel free to report any bugs here, i'll try to fix them asap. ;-)
Jul 29 2016