www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - On the subject of an XML parser

reply solidstate1991 <laszloszeremi outlook.com> writes:
Since the XML parsing library was removed from Phobos, I'm 
thinking about either getting dlang-community/experimental.xml 
into a usable state, or write a completely new parser.

First I'd want some community input, and would like to hear from 
the users of lodo1995's library. Depending on some circumstances, 
I'll be losing my job next month, so I'll have some extra time on 
my hands (no money will be a tough thing), and even without that 
I'll try to pull it off somehow.
Aug 22 2022
next sibling parent Adam D Ruppe <destructionator gmail.com> writes:
On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:
 First I'd want some community input
You might want to look into my dom.d too, it isn't suitable for all xml but it does a decent job on much of it.
Aug 22 2022
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Mon, Aug 22, 2022 at 11:48:47AM +0000, solidstate1991 via Digitalmars-d
wrote:
 Since the XML parsing library was removed from Phobos, I'm thinking
 about either getting dlang-community/experimental.xml into a usable
 state, or write a completely new parser.
 
 First I'd want some community input, and would like to hear from the
 users of lodo1995's library. Depending on some circumstances, I'll be
 losing my job next month, so I'll have some extra time on my hands (no
 money will be a tough thing), and even without that I'll try to pull
 it off somehow.
Why not use Jonathan's dxml? T -- "You know, maybe we don't *need* enemies." "Yeah, best friends are about all I can take." -- Calvin & Hobbes
Aug 22 2022
next sibling parent solidstate1991 <laszloszeremi outlook.com> writes:
On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:
 Why not use Jonathan's dxml?


 T
I might have a lot of free time in the near future, so I could write something for Phobos.
Aug 22 2022
prev sibling parent reply Chris Piker <chris hoopjump.com> writes:
On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:

 Why not use Jonathan's dxml?
I build a project off of dxml and found it to be quite nice, but I forgot to read the fine print (aka the template specialization). After a week of development when everything was working well, I tried to use it for parsing stdin and that's when the compiler let me know than an InputRange wasn't sufficient. Totally my fault, no knock against the author. So depending on the use case, dxml works quite well. For my own purposes I'll need to find/create a ForwardRange adapter for stdin or refactor my code to use another library.
Aug 22 2022
next sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Mon, Aug 22, 2022 at 10:51:44PM +0000, Chris Piker via Digitalmars-d wrote:
 On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:
 
 Why not use Jonathan's dxml?
I build a project off of dxml and found it to be quite nice, but I forgot to read the fine print (aka the template specialization). After a week of development when everything was working well, I tried to use it for parsing stdin and that's when the compiler let me know than an InputRange wasn't sufficient. Totally my fault, no knock against the author. So depending on the use case, dxml works quite well. For my own purposes I'll need to find/create a ForwardRange adapter for stdin or refactor my code to use another library.
Do you need to parse xml on-the-fly, or would it work to just slurp the entire stdin into a buffer and then parse that? In the former case, you could probably just accumulate incoming stdin lines into a buffer and parse that (though you'll probably need a wrapper, otherwise dxml may terminate prematurely at the end of the current line). In the latter case it should be a simple matter of using std.algorithm.copy to read stdin into a buffer, which can then be parsed with dxml. T -- There's light at the end of the tunnel. It's the oncoming train.
Aug 22 2022
parent Chris Piker <chris hoopjump.com> writes:
On Monday, 22 August 2022 at 23:30:58 UTC, H. S. Teoh wrote:
 Do you need to parse xml on-the-fly, or would it work to just 
 slurp the entire stdin into a buffer and then parse that?
Thanks for the suggestion, though the purpose of the lib is to support stream based processing of very long time series datasets (> 2 TB occurs occasionally). Due to data volume, we typically work with binary formats, but there is a supported XML representation and I'd prefer to apply the same mentality when processing it so as not to break user expectations.
Sep 13 2022
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/22/22 15:51, Chris Piker wrote:

 So depending on the use case, dxml works quite well.  For my own
 purposes I'll need to find/create a ForwardRange adapter for stdin
The 'cached' range adaptor I mentioned on these forums a couple of times and in my DConf 2022 lightning talk converts any InputRange to a ForwardRange. (It does this by evaluating the elements once; so it would be valuable with generators as well; and in fact, a generator use case was why I wrote it.) (Aside: It actually makes a RandomAccessRange because it supports opIndex as well but it does not honor O(1): It will grab 'n' elements if you say myRange[n] and if those elements are not in the cache yet.) Currently it has an assumed performance issue because it uses a regular D slice, and the way it uses the slice incurs an allocation cost per element. There are different ways of dealing with that issue but I haven't finished that yet. Ali
Aug 24 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/24/22 08:16, Ali Çehreli wrote:
 On 8/22/22 15:51, Chris Piker wrote:

  > So depending on the use case, dxml works quite well.  For my own
  > purposes I'll need to find/create a ForwardRange adapter for stdin

 The 'cached' range adaptor I mentioned on these forums a couple of times
 and in my DConf 2022 lightning talk converts any InputRange to a
 ForwardRange. (It does this by evaluating the elements once; so it would
 be valuable with generators as well; and in fact, a generator use case
 was why I wrote it.)
It is now available: https://code.dlang.org/packages/alid
 (Aside: It actually makes a RandomAccessRange because it supports
 opIndex as well but it does not honor O(1): It will grab 'n' elements if
 you say myRange[n] and if those elements are not in the cache yet.)
I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.
 Currently it has an assumed performance issue because it uses a regular
 D slice, and the way it uses the slice incurs an allocation cost per
 element. There are different ways of dealing with that issue but I
 haven't finished that yet.
I solved that by writing the `alid.circularblocks` module. Ali
Sep 12 2022
next sibling parent reply Chris Piker <chris hoopjump.com> writes:
On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:

 It is now available:

   https://code.dlang.org/packages/alid

 (Aside: It actually makes a RandomAccessRange because it
supports
 opIndex as well but it does not honor O(1): It will grab 'n'
elements if
 you say myRange[n] and if those elements are not in the cache
yet.) I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.
Wow pretty slick, thanks! I know everyone wants the D community to be larger, but there are some advantages to a tight group. Heck, I just got help on a ground support project from my favorite computer textbook author. Outstanding! As soon as I get back around to working on that project again I'll try out alid. Neck deep in another sprint right now which depends on dpq2. Best,
Sep 13 2022
parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 9/13/22 19:00, Chris Piker wrote:

 As soon as I get back around to working on that project again I'll try
 out alid.
Please don't give up but give feedback if it doesn't fit your use case. It desperately needs to be tested in the wild. :) Ali
Oct 02 2022
prev sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:
 On 8/24/22 08:16, Ali Çehreli wrote:
   https://code.dlang.org/packages/alid

 (Aside: It actually makes a RandomAccessRange because it
supports
 opIndex as well but it does not honor O(1): It will grab 'n'
elements if
 you say myRange[n] and if those elements are not in the cache
yet.) I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.
I believe it's actually **O(log(n))** amortized because `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the contents over to the new address, an **O(n)** operation, for every **O(log(n))** accesses to `ElementCache`. (This extra **O(log(n))** factor is typical for [Appender](https://dlang.org/phobos/std_array.html#Appender)-like systems.)
Oct 03 2022
next sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Monday, 3 October 2022 at 07:28:46 UTC, tsbockman wrote:
 On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:
 I realized that it is still O(1) because the seemingly 
 unnecessarily grabbed elements would still count as 
 "amortized" because they are readily available at O(1) for 
 consumption of both this range and all its .save'd ranges.
I believe it's actually **O(log(n))** amortized because `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the contents over to the new address, an **O(n)** operation, for every **O(log(n))** accesses to `ElementCache`. (This extra **O(log(n))** factor is typical for [Appender](https://dlang.org/phobos/std_array.html#Appender)-like systems.)
Nevermind - I forgot that the **n** when moving the old contents over is actually a different variable each time, whose average value is **O(n / log(n))**, which makes the whole thing reduce to amortized **O(1)** time per element, as you claimed.
Oct 03 2022
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 10/3/22 00:28, tsbockman wrote:

 `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the
 contents over to the new address,
CircularBlocks does not move elements. It just allocates and adds a new block to its "array of slices" queue. Algorithmic complexity does get complicated :) if the block size is too small compared to live elements. Then there would be too many small blocks to shuffle around e.g. to the back of the queue to be reused. Ali
Oct 03 2022
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:
 CircularBlocks does not move elements. It just allocates and 
 adds a new block to its "array of slices" queue.
Repeatedly appending `~=` to a dynamic array, as `CircularBlocks` does [here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/cir ularblocks.d#L422), will reallocate and move the elements over to the new memory whenever the new `.length` would exceed `.capacity`: ```D void addExistingBlock_(ubyte[] buffer) { import std.array : back; blocks ~= ReusableBlock!T(buffer); capacity_ += blocks.back.capacity; } ```
Oct 03 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 10/3/22 09:12, tsbockman wrote:
 On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:
 CircularBlocks does not move elements. It just allocates and adds a
 new block to its "array of slices" queue.
Repeatedly appending `~=` to a dynamic array, as `CircularBlocks` does
[here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/cir ularblocks.d#L422), will reallocate and move the elements over to the new memory whenever the new `.length` would exceed `.capacity`:
 ```D
      void addExistingBlock_(ubyte[] buffer)
      {
          import std.array : back;

          blocks ~= ReusableBlock!T(buffer);
          capacity_ += blocks.back.capacity;
      }
 ```
Yes but blocks_ does not carry elements but blocks to place elements on. CircularBlocks uses a circular buffer of blocks of elements. As elements are popped from the head, blocks that become empty are moved to the end of blocks_ array to be reused. (Here, we have O(M); if M (number of blocks) is small compared to N (number of live elements as in a sliding window), then we have just a few blocks to shuffle with std.algorithm.bringToFront.) addExistingBlock_() is called only when existing blocks are all in use. The typical use case is a sliding window of cached elements. Empty blocks at the frot are moved to the end and they get reused for new elements later. If the window width is constant, there will either be no memory allocation (if the user provided initial buffers), or some during the preamble as the number of buffers stabilize. There may be occasional new block allocations if the window enlarges depending on the application, but this high water mark helps reduce block allocations in the future. I appreciate your looking into this and I would be very happy to know about all performance issues. If my analysis is wrong above, please give me data for me to work with. :) Ali
Oct 03 2022
parent tsbockman <thomas.bockman gmail.com> writes:
On Monday, 3 October 2022 at 17:10:29 UTC, Ali Çehreli wrote:
 Yes but blocks_ does not carry elements but blocks to place 
 elements on.
I meant "elements" in the general sense of "items in an array". The blocks are stored in an array, and hence are "elements".
 Here, we have O(M); if M (number of blocks) is small compared 
 to N (number of live elements as in a sliding window), then we 
 have just a few blocks to shuffle with 
 std.algorithm.bringToFront.)

 addExistingBlock_() is called only when existing blocks are all 
 in use.

 The typical use case is a sliding window of cached elements. 
 Empty blocks at the frot are moved to the end and they get 
 reused for new elements later.
I don't think this changes the big O run time analysis. It sounds like, at least in the worst case, **M** is proportional to **N**, meaning that for big O purposes **M** is **N**. Regardless, it would still be **O(1)** amortized time like you said before. To change that **M** would have to be greater than **N**, which never happens.
Oct 03 2022
prev sibling next sibling parent JN <666total wp.pl> writes:
On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:
 Since the XML parsing library was removed from Phobos, I'm 
 thinking about either getting dlang-community/experimental.xml 
 into a usable state, or write a completely new parser.

 First I'd want some community input, and would like to hear 
 from the users of lodo1995's library. Depending on some 
 circumstances, I'll be losing my job next month, so I'll have 
 some extra time on my hands (no money will be a tough thing), 
 and even without that I'll try to pull it off somehow.
I never really understood why we have to make a new library instead of just fixing std.xml. I found std.xml to be the easiest to use out of all D libraries, but my understanding was it's not up to par performance wise.
 I might have a lot of free time in the near future, so I could 
 write something for Phobos.
Honestly, that would probably end up as std.experimental.xml2. I think it'd be very hard to get a new library into Phobos. As soon as you open yourself to comments, everyone will have their own idea of what a perfect XML library would look like. And after that, there will be a struggle whether it should use exceptions or not, GC or no GC, or maybe even betterC.
Aug 22 2022
prev sibling next sibling parent reply Dejan Lekic <dejan.lekic gmail.com> writes:
On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:
 First I'd want some community input, and would like to hear 
 from the users of lodo1995's library. Depending on some 
 circumstances, I'll be losing my job next month, so I'll have 
 some extra time on my hands (no money will be a tough thing), 
 and even without that I'll try to pull it off somehow.
You made my day as I need a good XML library that is as good as Python's xml (https://docs.python.org/3/library/xml.html) package. You are absolutely right, https://github.com/dlang-community/experimental.xml is a good starting point. See what is missing, as well as what can be improved. It is a pity that package did not get finished... I gave up using D for any XML processing long ago but perhaps your library will be good enough for some of my future XML processing tasks.
Aug 24 2022
parent reply solidstate1991 <laszloszeremi outlook.com> writes:
On Wednesday, 24 August 2022 at 17:15:42 UTC, Dejan Lekic wrote:
 You made my day as I need a good XML library that is as good as 
 Python's xml (https://docs.python.org/3/library/xml.html) 
 package.

 You are absolutely right, 
 https://github.com/dlang-community/experimental.xml is a good 
 starting point. See what is missing, as well as what can be 
 improved. It is a pity that package did not get finished... I 
 gave up using D for any XML processing long ago but perhaps 
 your library will be good enough for some of my future XML 
 processing tasks.
I took a look at experimental.xml. According to its tests, it's biggest issue is that it accepts malformed documents. I'll attempt to reverse-engineer the code, then add the necessary checks to reject the malformed documents. Since it has multiple options for allocators (stdx-allocator), it'll be a bit of a challenge, but at worst I can strip that function and replace it with GC only.
Aug 25 2022
parent solidstate1991 <laszloszeremi outlook.com> writes:
On Thursday, 25 August 2022 at 19:41:19 UTC, solidstate1991 wrote:
 I took a look at experimental.xml. According to its tests, it's 
 biggest issue is that it accepts malformed documents. I'll 
 attempt to reverse-engineer the code, then add the necessary 
 checks to reject the malformed documents. Since it has multiple 
 options for allocators (stdx-allocator), it'll be a bit of a 
 challenge, but at worst I can strip that function and replace 
 it with GC only.
So work have begun here: https://github.com/ZILtoid1991/experimental.xml Things I've done so far: * Stripped the allocators and the custom error handling functions. Not much people are using allocators anyways, it just complicates the project, and GC is otherwise the best option for anything that builds a complex tree structure. With that gone, I can just use exceptions for error handling, which can be toggled with a flag: turning it off will enable parsing badly formed XML documents, and even SGML in theory. * Simplifying a lot of things in general, with array slicing and appending. * Enabled character escaping, which led me into the DTD hellhole. * Enabled checking for bad characters in names and texts. * Started working on the processing of XML declarations (important for setting version and checking for correct encoding), and the DTD. I know that the removal of the allocators might doom my project from the inclusion in the Phobos library, but even then I can just release it as a regular dub library. Soon I'll be renaming it to newXML or something similar, while keeping the credits to its previous authors.
Sep 01 2022
prev sibling parent James Blachly <james.blachly gmail.com> writes:
On 8/22/22 7:48 AM, solidstate1991 wrote:
 Since the XML parsing library was removed from Phobos, I'm thinking 
 about either getting dlang-community/experimental.xml into a usable 
 state, or write a completely new parser.
 
 First I'd want some community input, and would like to hear from the 
 users of lodo1995's library. Depending on some circumstances, I'll be 
 losing my job next month, so I'll have some extra time on my hands (no 
 money will be a tough thing), and even without that I'll try to pull it 
 off somehow.
Would be nice to have XSD support; many (most?) XML libraries I've looked at _across all languages_ only support DTD but not XSD schema def.
Oct 02 2022