digitalmars.D - On the subject of an XML parser

solidstate1991 (8/8) Aug 22 2022 Since the XML parsing library was removed from Phobos, I'm

Adam D Ruppe (3/4) Aug 22 2022 You might want to look into my dom.d too, it isn't suitable for
H. S. Teoh (5/14) Aug 22 2022 Why not use Jonathan's dxml?

solidstate1991 (3/5) Aug 22 2022 I might have a lot of free time in the near future, so I could
Chris Piker (10/11) Aug 22 2022 I build a project off of dxml and found it to be quite nice, but

H. S. Teoh (13/27) Aug 22 2022 Do you need to parse xml on-the-fly, or would it work to just slurp the

Chris Piker (7/9) Sep 13 2022 Thanks for the suggestion, though the purpose of the lib is to

=?UTF-8?Q?Ali_=c3=87ehreli?= (14/16) Aug 24 2022 The 'cached' range adaptor I mentioned on these forums a couple of times...

=?UTF-8?Q?Ali_=c3=87ehreli?= (9/24) Sep 12 2022 It is now available:

Chris Piker (9/21) Sep 13 2022 Wow pretty slick, thanks! I know everyone wants the D community

=?UTF-8?Q?Ali_=c3=87ehreli?= (4/6) Oct 02 2022 Please don't give up but give feedback if it doesn't fit your use case.

tsbockman (8/20) Oct 03 2022 I believe it's actually **O(log(n))** amortized because

tsbockman (5/16) Oct 03 2022 Nevermind - I forgot that the **n** when moving the old contents
=?UTF-8?Q?Ali_=c3=87ehreli?= (7/9) Oct 03 2022 CircularBlocks does not move elements. It just allocates and adds a new

tsbockman (12/14) Oct 03 2022 Repeatedly appending `~=` to a dynamic array, as `CircularBlocks`

=?UTF-8?Q?Ali_=c3=87ehreli?= (24/37) Oct 03 2022 [here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/ci...

tsbockman (9/20) Oct 03 2022 I meant "elements" in the general sense of "items in an array".

JN (11/21) Aug 22 2022 I never really understood why we have to make a new library
Dejan Lekic (10/15) Aug 24 2022 You made my day as I need a good XML library that is as good as

solidstate1991 (8/18) Aug 25 2022 I took a look at experimental.xml. According to its tests, it's

solidstate1991 (23/30) Sep 01 2022 So work have begun here:

James Blachly (3/12) Oct 02 2022 Would be nice to have XSD support; many (most?) XML libraries I've

solidstate1991 <laszloszeremi outlook.com> writes:

Since the XML parsing library was removed from Phobos, I'm 
thinking about either getting dlang-community/experimental.xml 
into a usable state, or write a completely new parser.

First I'd want some community input, and would like to hear from 
the users of lodo1995's library. Depending on some circumstances, 
I'll be losing my job next month, so I'll have some extra time on 
my hands (no money will be a tough thing), and even without that 
I'll try to pull it off somehow.

Aug 22 2022

Adam D Ruppe <destructionator gmail.com> writes:

On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:
 First I'd want some community input

You might want to look into my dom.d too, it isn't suitable for 
all xml but it does a decent job on much of it.

Aug 22 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Mon, Aug 22, 2022 at 11:48:47AM +0000, solidstate1991 via Digitalmars-d
wrote:
 Since the XML parsing library was removed from Phobos, I'm thinking
 about either getting dlang-community/experimental.xml into a usable
 state, or write a completely new parser.
 
 First I'd want some community input, and would like to hear from the
 users of lodo1995's library. Depending on some circumstances, I'll be
 losing my job next month, so I'll have some extra time on my hands (no
 money will be a tough thing), and even without that I'll try to pull
 it off somehow.

Why not use Jonathan's dxml?


T

-- 
"You know, maybe we don't *need* enemies." "Yeah, best friends are about all I
can take." -- Calvin & Hobbes

Aug 22 2022

solidstate1991 <laszloszeremi outlook.com> writes:

On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:
 Why not use Jonathan's dxml?


 T

I might have a lot of free time in the near future, so I could 
write something for Phobos.

Aug 22 2022

Chris Piker <chris hoopjump.com> writes:

On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:

 Why not use Jonathan's dxml?

I build a project off of dxml and found it to be quite nice, but 
I forgot to read the fine print (aka the template 
specialization).  After a week of development when everything was 
working well, I tried to use it for parsing stdin and that's when 
the compiler let me know than an InputRange wasn't sufficient.  
Totally my fault, no knock against the author.

So depending on the use case, dxml works quite well.  For my own 
purposes I'll need to find/create a ForwardRange adapter for 
stdin or refactor my code to use another library.

Aug 22 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Mon, Aug 22, 2022 at 10:51:44PM +0000, Chris Piker via Digitalmars-d wrote:
 On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:
 
 Why not use Jonathan's dxml?

 
 I build a project off of dxml and found it to be quite nice, but I
 forgot to read the fine print (aka the template specialization).
 After a week of development when everything was working well, I tried
 to use it for parsing stdin and that's when the compiler let me know
 than an InputRange wasn't sufficient.  Totally my fault, no knock
 against the author.
 
 So depending on the use case, dxml works quite well.  For my own
 purposes I'll need to find/create a ForwardRange adapter for stdin or
 refactor my code to use another library.

Do you need to parse xml on-the-fly, or would it work to just slurp the
entire stdin into a buffer and then parse that?

In the former case, you could probably just accumulate incoming stdin
lines into a buffer and parse that (though you'll probably need a
wrapper, otherwise dxml may terminate prematurely at the end of the
current line).

In the latter case it should be a simple matter of using
std.algorithm.copy to read stdin into a buffer, which can then be parsed
with dxml.


T

-- 
There's light at the end of the tunnel. It's the oncoming train.

Aug 22 2022

Chris Piker <chris hoopjump.com> writes:

On Monday, 22 August 2022 at 23:30:58 UTC, H. S. Teoh wrote:
 Do you need to parse xml on-the-fly, or would it work to just 
 slurp the entire stdin into a buffer and then parse that?

Thanks for the suggestion, though the purpose of the lib is to 
support stream based processing of very long time series datasets 
(> 2 TB occurs occasionally).  Due to data volume, we typically 
work with binary formats, but there is a supported XML 
representation and I'd prefer to apply the same mentality when 
processing it so as not to break user expectations.

Sep 13 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 8/22/22 15:51, Chris Piker wrote:

 So depending on the use case, dxml works quite well.  For my own
 purposes I'll need to find/create a ForwardRange adapter for stdin

The 'cached' range adaptor I mentioned on these forums a couple of times 
and in my DConf 2022 lightning talk converts any InputRange to a 
ForwardRange. (It does this by evaluating the elements once; so it would 
be valuable with generators as well; and in fact, a generator use case 
was why I wrote it.)

(Aside: It actually makes a RandomAccessRange because it supports 
opIndex as well but it does not honor O(1): It will grab 'n' elements if 
you say myRange[n] and if those elements are not in the cache yet.)

Currently it has an assumed performance issue because it uses a regular 
D slice, and the way it uses the slice incurs an allocation cost per 
element. There are different ways of dealing with that issue but I 
haven't finished that yet.

Ali

Aug 24 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 8/24/22 08:16, Ali Çehreli wrote:
 On 8/22/22 15:51, Chris Piker wrote:

  > So depending on the use case, dxml works quite well.  For my own
  > purposes I'll need to find/create a ForwardRange adapter for stdin

 The 'cached' range adaptor I mentioned on these forums a couple of times
 and in my DConf 2022 lightning talk converts any InputRange to a
 ForwardRange. (It does this by evaluating the elements once; so it would
 be valuable with generators as well; and in fact, a generator use case
 was why I wrote it.)

It is now available:

   https://code.dlang.org/packages/alid

 (Aside: It actually makes a RandomAccessRange because it supports
 opIndex as well but it does not honor O(1): It will grab 'n' elements if
 you say myRange[n] and if those elements are not in the cache yet.)

I realized that it is still O(1) because the seemingly unnecessarily 
grabbed elements would still count as "amortized" because they are 
readily available at O(1) for consumption of both this range and all its 
.save'd ranges.

 Currently it has an assumed performance issue because it uses a regular
 D slice, and the way it uses the slice incurs an allocation cost per
 element. There are different ways of dealing with that issue but I
 haven't finished that yet.

I solved that by writing the `alid.circularblocks` module.

Ali

Sep 12 2022

Chris Piker <chris hoopjump.com> writes:

On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:

 It is now available:

   https://code.dlang.org/packages/alid

 (Aside: It actually makes a RandomAccessRange because it

 supports
 opIndex as well but it does not honor O(1): It will grab 'n'

 elements if
 you say myRange[n] and if those elements are not in the cache

 yet.)

 I realized that it is still O(1) because the seemingly 
 unnecessarily grabbed elements would still count as "amortized" 
 because they are readily available at O(1) for consumption of 
 both this range and all its .save'd ranges.

Wow pretty slick, thanks!  I know everyone wants the D community 
to be larger, but there are some advantages to a tight group.  
Heck, I just got help on a ground support project from my 
favorite computer textbook author.  Outstanding!

As soon as I get back around to working on that project again 
I'll try out alid.  Neck deep in another sprint right now which 
depends on dpq2.

Best,

Sep 13 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 9/13/22 19:00, Chris Piker wrote:

 As soon as I get back around to working on that project again I'll try
 out alid.

Please don't give up but give feedback if it doesn't fit your use case. 
It desperately needs to be tested in the wild. :)

Ali

Oct 02 2022

tsbockman <thomas.bockman gmail.com> writes:

On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:
 On 8/24/22 08:16, Ali Çehreli wrote:
   https://code.dlang.org/packages/alid

 (Aside: It actually makes a RandomAccessRange because it

 supports
 opIndex as well but it does not honor O(1): It will grab 'n'

 elements if
 you say myRange[n] and if those elements are not in the cache

 yet.)

 I realized that it is still O(1) because the seemingly 
 unnecessarily grabbed elements would still count as "amortized" 
 because they are readily available at O(1) for consumption of 
 both this range and all its .save'd ranges.

I believe it's actually **O(log(n))** amortized because 
`CircularBlocks.addExistingBlock_` will reallocate `blocks` and 
move the contents over to the new address, an **O(n)** operation, 
for every **O(log(n))** accesses to `ElementCache`.

(This extra **O(log(n))** factor is typical for 
[Appender](https://dlang.org/phobos/std_array.html#Appender)-like 
systems.)

Oct 03 2022

tsbockman <thomas.bockman gmail.com> writes:

On Monday, 3 October 2022 at 07:28:46 UTC, tsbockman wrote:
 On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:
 I realized that it is still O(1) because the seemingly 
 unnecessarily grabbed elements would still count as 
 "amortized" because they are readily available at O(1) for 
 consumption of both this range and all its .save'd ranges.

 I believe it's actually **O(log(n))** amortized because 
 `CircularBlocks.addExistingBlock_` will reallocate `blocks` and 
 move the contents over to the new address, an **O(n)** 
 operation, for every **O(log(n))** accesses to `ElementCache`.

 (This extra **O(log(n))** factor is typical for 
 [Appender](https://dlang.org/phobos/std_array.html#Appender)-like systems.)

Nevermind - I forgot that the **n** when moving the old contents 
over is actually a different variable each time, whose average 
value is **O(n / log(n))**, which makes the whole thing reduce to 
amortized **O(1)** time per element, as you claimed.

Oct 03 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/3/22 00:28, tsbockman wrote:

 `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the
 contents over to the new address,

CircularBlocks does not move elements. It just allocates and adds a new 
block to its "array of slices" queue.

Algorithmic complexity does get complicated :) if the block size is too 
small compared to live elements. Then there would be too many small 
blocks to shuffle around e.g. to the back of the queue to be reused.

Ali

Oct 03 2022

tsbockman <thomas.bockman gmail.com> writes:

On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:
 CircularBlocks does not move elements. It just allocates and 
 adds a new block to its "array of slices" queue.

Repeatedly appending `~=` to a dynamic array, as `CircularBlocks` 
does 
[here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/cir
ularblocks.d#L422), will reallocate and move the elements over to the new
memory whenever the new `.length` would exceed `.capacity`:
```D
     void addExistingBlock_(ubyte[] buffer)
     {
         import std.array : back;

         blocks ~= ReusableBlock!T(buffer);
         capacity_ += blocks.back.capacity;
     }
```

Oct 03 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/3/22 09:12, tsbockman wrote:
 On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:
 CircularBlocks does not move elements. It just allocates and adds a
 new block to its "array of slices" queue.

 Repeatedly appending `~=` to a dynamic array, as `CircularBlocks` does
 

[here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/cir
ularblocks.d#L422), 
will reallocate and move the elements over to the new memory whenever 
the new `.length` would exceed `.capacity`:
 ```D
      void addExistingBlock_(ubyte[] buffer)
      {
          import std.array : back;

          blocks ~= ReusableBlock!T(buffer);
          capacity_ += blocks.back.capacity;
      }
 ```

Yes but blocks_ does not carry elements but blocks to place elements on.

CircularBlocks uses a circular buffer of blocks of elements. As elements 
are popped from the head, blocks that become empty are moved to the end 
of blocks_ array to be reused. (Here, we have O(M); if M (number of 
blocks) is small compared to N (number of live elements as in a sliding 
window), then we have just a few blocks to shuffle with 
std.algorithm.bringToFront.)

addExistingBlock_() is called only when existing blocks are all in use.

The typical use case is a sliding window of cached elements. Empty 
blocks at the frot are moved to the end and they get reused for new 
elements later.

If the window width is constant, there will either be no memory 
allocation (if the user provided initial buffers), or some during the 
preamble as the number of buffers stabilize. There may be occasional new 
block allocations if the window enlarges depending on the application, 
but this high water mark helps reduce block allocations in the future.

I appreciate your looking into this and I would be very happy to know 
about all performance issues. If my analysis is wrong above, please give 
me data for me to work with. :)

Ali

Oct 03 2022

tsbockman <thomas.bockman gmail.com> writes:

On Monday, 3 October 2022 at 17:10:29 UTC, Ali Çehreli wrote:
 Yes but blocks_ does not carry elements but blocks to place 
 elements on.

I meant "elements" in the general sense of "items in an array". 
The blocks are stored in an array, and hence are "elements".

 Here, we have O(M); if M (number of blocks) is small compared 
 to N (number of live elements as in a sliding window), then we 
 have just a few blocks to shuffle with 
 std.algorithm.bringToFront.)

 addExistingBlock_() is called only when existing blocks are all 
 in use.

 The typical use case is a sliding window of cached elements. 
 Empty blocks at the frot are moved to the end and they get 
 reused for new elements later.

I don't think this changes the big O run time analysis. It sounds 
like, at least in the worst case, **M** is proportional to **N**, 
meaning that for big O purposes **M** is **N**.

Regardless, it would still be **O(1)** amortized time like you 
said before. To change that **M** would have to be greater than 
**N**, which never happens.

Oct 03 2022

JN <666total wp.pl> writes:

On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:
 Since the XML parsing library was removed from Phobos, I'm 
 thinking about either getting dlang-community/experimental.xml 
 into a usable state, or write a completely new parser.

 First I'd want some community input, and would like to hear 
 from the users of lodo1995's library. Depending on some 
 circumstances, I'll be losing my job next month, so I'll have 
 some extra time on my hands (no money will be a tough thing), 
 and even without that I'll try to pull it off somehow.

I never really understood why we have to make a new library 
instead of just fixing std.xml. I found std.xml to be the easiest 
to use out of all D libraries, but my understanding was it's not 
up to par performance wise.

 I might have a lot of free time in the near future, so I could 
 write something for Phobos.

Honestly, that would probably end up as std.experimental.xml2. I 
think it'd be very hard to get a new library into Phobos. As soon 
as you open yourself to comments, everyone will have their own 
idea of what a perfect XML library would look like. And after 
that, there will be a struggle whether it should use exceptions 
or not, GC or no GC, or maybe even betterC.

Aug 22 2022

Dejan Lekic <dejan.lekic gmail.com> writes:

On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:
 First I'd want some community input, and would like to hear 
 from the users of lodo1995's library. Depending on some 
 circumstances, I'll be losing my job next month, so I'll have 
 some extra time on my hands (no money will be a tough thing), 
 and even without that I'll try to pull it off somehow.

You made my day as I need a good XML library that is as good as 
Python's xml (https://docs.python.org/3/library/xml.html) package.

You are absolutely right, 
https://github.com/dlang-community/experimental.xml is a good 
starting point. See what is missing, as well as what can be 
improved. It is a pity that package did not get finished... I 
gave up using D for any XML processing long ago but perhaps your 
library will be good enough for some of my future XML processing 
tasks.

Aug 24 2022

solidstate1991 <laszloszeremi outlook.com> writes:

On Wednesday, 24 August 2022 at 17:15:42 UTC, Dejan Lekic wrote:
 You made my day as I need a good XML library that is as good as 
 Python's xml (https://docs.python.org/3/library/xml.html) 
 package.

 You are absolutely right, 
 https://github.com/dlang-community/experimental.xml is a good 
 starting point. See what is missing, as well as what can be 
 improved. It is a pity that package did not get finished... I 
 gave up using D for any XML processing long ago but perhaps 
 your library will be good enough for some of my future XML 
 processing tasks.

I took a look at experimental.xml. According to its tests, it's 
biggest issue is that it accepts malformed documents. I'll 
attempt to reverse-engineer the code, then add the necessary 
checks to reject the malformed documents. Since it has multiple 
options for allocators (stdx-allocator), it'll be a bit of a 
challenge, but at worst I can strip that function and replace it 
with GC only.

Aug 25 2022

solidstate1991 <laszloszeremi outlook.com> writes:

On Thursday, 25 August 2022 at 19:41:19 UTC, solidstate1991 wrote:
 I took a look at experimental.xml. According to its tests, it's 
 biggest issue is that it accepts malformed documents. I'll 
 attempt to reverse-engineer the code, then add the necessary 
 checks to reject the malformed documents. Since it has multiple 
 options for allocators (stdx-allocator), it'll be a bit of a 
 challenge, but at worst I can strip that function and replace 
 it with GC only.

So work have begun here: 
https://github.com/ZILtoid1991/experimental.xml

Things I've done so far:
  * Stripped the allocators and the custom error handling 
functions. Not much people are using allocators anyways, it just 
complicates the project, and GC is otherwise the best option for 
anything that builds a complex tree structure. With that gone, I 
can just use exceptions for error handling, which can be toggled 
with a flag: turning it off will enable parsing badly formed XML 
documents, and even SGML in theory.
  * Simplifying a lot of things in general, with array slicing and 
appending.
  * Enabled character escaping, which led me into the DTD hellhole.
  * Enabled checking for bad characters in names and texts.
  * Started working on the processing of XML declarations 
(important for setting version and checking for correct 
encoding), and the DTD.

I know that the removal of the allocators might doom my project 
from the inclusion in the Phobos library, but even then I can 
just release it as a regular dub library. Soon I'll be renaming 
it to newXML or something similar, while keeping the credits to 
its previous authors.

Sep 01 2022

James Blachly <james.blachly gmail.com> writes:

On 8/22/22 7:48 AM, solidstate1991 wrote:
 Since the XML parsing library was removed from Phobos, I'm thinking 
 about either getting dlang-community/experimental.xml into a usable 
 state, or write a completely new parser.
 
 First I'd want some community input, and would like to hear from the 
 users of lodo1995's library. Depending on some circumstances, I'll be 
 losing my job next month, so I'll have some extra time on my hands (no 
 money will be a tough thing), and even without that I'll try to pull it 
 off somehow.

Would be nice to have XSD support; many (most?) XML libraries I've 
looked at _across all languages_ only support DTD but not XSD schema def.

Oct 02 2022

D Programming

C/C++ Programming

Other

digitalmars.D - On the subject of an XML parser