digitalmars.D - Replacing std.xml

w0rp (14/14) Aug 29 2013 Hello everybody. I've been wondering, what are the current plans

Jonathan M Davis (32/48) Aug 29 2013 Someone needs to step forward, write it, and get it through the review

Jacob Carlborg (5/8) Aug 29 2013 Won't that have the same problem as we talked about in of the threads

Jonathan M Davis (13/20) Aug 29 2013 Possibly, but then all you have to do is make it so that it treats strin...

Jacob Carlborg (10/14) Aug 29 2013 What! I hardly believe that. That might be the case for HTML but I don't...

Chris (3/18) Aug 29 2013 And while we're at it, what about YAML? It's a subset of JSON

Jacob Carlborg (5/7) Aug 29 2013 YAML is a super set of JSON, not the other way around. But yes, I would

Chris (4/10) Aug 29 2013 Yes of course, you are right. I found this on the internet. Seems

Kiith-Sa (8/21) Aug 29 2013 It's not really abandoned, I keep updating it with compatibility

Jonathan M Davis (14/30) Aug 29 2013 Well, as I said, I couldn't remember exactly what the XML standard said ...

Brad Anderson (9/32) Aug 29 2013 You just specify the encoding in the root element.
Jacob Carlborg (5/10) Aug 29 2013 I don't understand. If use a range of dchar and call "front" and

Jonathan M Davis (9/19) Aug 29 2013 Any decent parser is going to special-case strings (especially if it's u...

Michel Fortin (18/30) Aug 29 2013 The XML standard says that an XML parser MUST support UTF-8 and UTF-16,

H. S. Teoh (37/49) Aug 29 2013 Take a look here:

Jacob Carlborg (6/8) Aug 29 2013 Actually, does the encoding really matters (as long as it's compatible
Brad Anderson (5/57) Aug 29 2013 It doesn't look like adding the rest of the ISO-8859 encodings
deadalnix (4/26) Aug 31 2013 As this is not the first time I see it used as a reliable source,

Sean Kelly (11/23) Aug 29 2013 wstring,

Brad Anderson (3/10) Aug 29 2013 This makes me wonder what kind of optimizations a hypothetical

H. S. Teoh (8/28) Aug 29 2013 Right, that's why I said the core of std.xml should handle everything as

Joakim (5/12) Aug 29 2013 I think it's great that there's no std.xml, as it implies that

maarten van damme (4/14) Aug 29 2013 and imagine someone forced to use xml who reads this answer from the
Chris (4/7) Aug 29 2013 No way around XML. A must have, as has been said in this thread.

H. S. Teoh (26/34) Aug 29 2013 While I do agree that in the current state of affairs, XML support is a

Chris (10/50) Aug 29 2013 I am moving away from XML too. Wanted to use it for a private
Jason den Dulk (7/10) Aug 29 2013 The main disavantage of JSON vs XML is lack of validation.

Joakim (10/18) Aug 29 2013 We already have a std.json in Phobos for years now. I think it'd

w0rp (22/41) Aug 29 2013 JSON is better than XML in every way I can think of. Easier to

Michel Fortin (21/37) Aug 29 2013 I wrote something like that a while ago.

Jonathan M Davis (7/46) Aug 29 2013 Cool. I started looking at implementing something like that a while back...
ilya-stromberg (2/19) Sep 03 2013 Can you push it to the github, please?

Michel Fortin (9/34) Sep 03 2013 Good idea.

Johannes Pfau (8/23) Aug 29 2013 I most points here also apply to std.xml:

Robert Schadek (6/10) Aug 29 2013 I think, this even extends to access to all semi- and structured-data.

Jacob Carlborg (6/11) Aug 29 2013 So you want serialization :). Which we currently are reviewing.

Robert Schadek (4/7) Aug 29 2013 well, sort of, but also with partial serialization (think sql update),

Brad Anderson (10/26) Aug 29 2013 That's a really great point. All of these modules that can't

Brad Anderson (2/11) Aug 29 2013 (or maybe just improve Variant)

Tobias Pankrath (4/18) Aug 29 2013 There is http://dsource.org/projects/xmlp, which at some point

ilya-stromberg (8/11) Aug 31 2013 Also, we have Tango Xml:

Jacob Carlborg (5/10) Aug 31 2013 Unfortunately the Tango XML package will never end up in Phobos due to

ilya-stromberg (3/5) Sep 01 2013 Yes, but we can always learn source code and put attention to the

Jonathan M Davis (6/12) Sep 01 2013 Not really. Looking at the source code effectively taints you. By doing ...

Michel Fortin (17/27) Aug 31 2013 Someone should benchmark it against the XML implementation I made. It

Jacob Carlborg (5/8) Aug 31 2013 I guess "Pull" is the key here. That it is the client's responsibility
qznc (7/31) Sep 02 2013 Recursion means you use the call stack instead of stack object on

Michel Fortin (6/20) Sep 02 2013 Good point about caring for pathological cases.

Richard Webb (5/8) Sep 02 2013 Has anyone done any benchmarks recently to see if that is still the case...

Andrei Alexandrescu (5/18) Aug 29 2013 I don't know much about XML, but I noticed there are a few popular
Jonathan M Davis (7/11) Aug 29 2013 That works especially well with how Michel and I were thinking it should...
Walter Bright (8/20) Aug 31 2013 The Tango implementation of XML has been very well received. I haven't l...
Lionello Lunesu (10/23) Sep 02 2013 Having been the lead programmer on the Microsoft XML team for three

Peter Williams (13/38) Sep 02 2013 For whoever ends up doing std.xml's replacement, it would be good if
Andrei Alexandrescu (3/11) Sep 03 2013 This is great info, thanks.

"w0rp" <devw0rp gmail.com> writes:

Hello everybody. I've been wondering, what are the current plans 
to replace std.xml? I'd like to help with the effort to get a 
final XML library in phobos. So, I have a few questions.

First, and most importantly, what do we except out of a D XML 
library? I'd really like to have a discussion of the form, "Here 
is exactly the interface the structs/classes need to implement, 
go forth and implement." The general idea in my mind is 
"something SAX-like, with something a little DOM-like." I'm aware 
that std.xml has some issues support different encodings, so 
obvious that's included.

Second, is there an existing library that has gotten close to 
meeting whatever we need for the first point? If so, how far away 
is it from being able to meet all of the requirements and become 
the standard library version?

Aug 29 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 29, 2013 09:25:35 w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans
 to replace std.xml? I'd like to help with the effort to get a
 final XML library in phobos. So, I have a few questions.

Someone needs to step forward, write it, and get it through the review 
process. A while back, someone was working on a possible new version of 
std.xml, but they disappeared. No one has stepped up since. I'd love to do it 
if I had time, but I don't. There are probably several others around here in 
the same boat, but until someone who has the time and skill does do it, we 
won't have a new std.xml.

 First, and most importantly, what do we except out of a D XML
 library? I'd really like to have a discussion of the form, "Here
 is exactly the interface the structs/classes need to implement,
 go forth and implement." 

Except that that's really the task of the person creating the new std.xml. 
Generally what happens is that the person writing the module comes up with an 
API and then presents it rather than asking others to come up with ideas to 
design it for them. Obviously, ideas can be discussed, but design-by-committee 
is arguably a bad idea. And it just works better to have a concrete design to 
discuss.

 The general idea in my mind is
 "something SAX-like, with something a little DOM-like."

What I personally think would be best is to have multiple parsers. First you 
have something STAX-like (or maybe even lower level - I don't recall exactly 
what STAX gives you at the moment) that basically tokenizes the XML and 
returns a range of that. Then SAX and DOM parsers can be built on top of that. 
That way, you get the fastest parser possible as well as higher level, more 
functional parsers.

But two of the biggest points of the design are that it's going to have to be 
range-based, and it's going to need to be able to take full advantage of 
slices (when used with any strings or random-access ranges) in order to avoid 
copying any of the data. That's the key design point which will allow a D 
parser to be extremely fast in comparison to parsers in most other languages.

 I'm aware
 that std.xml has some issues support different encodings, so
 obvious that's included.

Personally, I would have just said use ranges of dchar and be done with it 
without worrying about character encodings at all, but I don't remember what 
all the XML standard does with encodings.

 Second, is there an existing library that has gotten close to
 meeting whatever we need for the first point? If so, how far away
 is it from being able to meet all of the requirements and become
 the standard library version?

There are several D XML libraries floating around, but no one has taken the 
time to get any of the prepared for the Phobos review queue, and I suspect 
that very few of them are range-based like the Phobos XML solution needs to 
be, but I don't know.

- Jonathan M Davis

Aug 29 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-29 09:47, Jonathan M Davis wrote:

 Personally, I would have just said use ranges of dchar and be done with it
 without worrying about character encodings at all, but I don't remember what
 all the XML standard does with encodings.

Won't that have the same problem as we talked about in of the threads 
about a D lexer? That is, doing unnecessary en/decoding.

-- 
/Jacob Carlborg

Aug 29 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 29, 2013 11:08:18 Jacob Carlborg wrote:
 On 2013-08-29 09:47, Jonathan M Davis wrote:
 Personally, I would have just said use ranges of dchar and be done with it
 without worrying about character encodings at all, but I don't remember
 what all the XML standard does with encodings.

 
 Won't that have the same problem as we talked about in of the threads
 about a D lexer? That is, doing unnecessary en/decoding.

Possibly, but then all you have to do is make it so that it treats strings as 
ranges of code units (and possibly support ranges of char and wchar), and you 
can avoid the unnecessary decoding. But aside from possibly support ranges of 
char or wchar, that would be completely internal to the parser, and the caller 
wouldn't care. An alternative would be to specifically support ranges of ubyte 
instead of strings, though given that XML is usually treated as a string, that 
would arguably be a bit odd. Regardless, as far as strings go, it's easy 
enough to avoid decoding in the implementation. IIRC, everything in XML is 
ASCII anyway, with stuff like HTML codes to indicate Unicode characters. And if 
that's the case, avoiding unnecessary decoding is trivial when operating on 
strings.

- Jonathan M Davis

Aug 29 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-29 11:23, Jonathan M Davis wrote:

 IIRC, everything in XML is
 ASCII anyway, with stuff like HTML codes to indicate Unicode characters. And if
 that's the case, avoiding unnecessary decoding is trivial when operating on
 strings.

What! I hardly believe that. That might be the case for HTML but I don't 
think it is for XML. There are many file formats that are based on XML. 
I don't think all those use HTML codes.

This is what W3 Schools says:

"XML documents can contain non ASCII characters, like Norwegian æ ø å , 
or French ê è é.

To avoid errors, specify the XML encoding, or save XML files as Unicode.".

-- 
/Jacob Carlborg

Aug 29 2013

"Chris" <wendlec tcd.ie> writes:

On Thursday, 29 August 2013 at 13:20:40 UTC, Jacob Carlborg wrote:
 On 2013-08-29 11:23, Jonathan M Davis wrote:

 IIRC, everything in XML is
 ASCII anyway, with stuff like HTML codes to indicate Unicode 
 characters. And if
 that's the case, avoiding unnecessary decoding is trivial when 
 operating on
 strings.

 What! I hardly believe that. That might be the case for HTML 
 but I don't think it is for XML. There are many file formats 
 that are based on XML. I don't think all those use HTML codes.

 This is what W3 Schools says:

 "XML documents can contain non ASCII characters, like Norwegian 
 æ ø å , or French ê è é.

 To avoid errors, specify the XML encoding, or save XML files as 
 Unicode.".

And while we're at it, what about YAML? It's a subset of JSON 
which means the new json.d module will handle it, I suppose.

Aug 29 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-29 16:07, Chris wrote:

 And while we're at it, what about YAML? It's a subset of JSON which
 means the new json.d module will handle it, I suppose.

YAML is a super set of JSON, not the other way around. But yes, I would 
like to have YAML support as well.

-- 
/Jacob Carlborg

Aug 29 2013

"Chris" <wendlec tcd.ie> writes:

On Thursday, 29 August 2013 at 19:26:21 UTC, Jacob Carlborg wrote:
 On 2013-08-29 16:07, Chris wrote:

 And while we're at it, what about YAML? It's a subset of JSON 
 which
 means the new json.d module will handle it, I suppose.

 YAML is a super set of JSON, not the other way around. But yes, 
 I would like to have YAML support as well.

Yes of course, you are right. I found this on the internet. Seems 
to be abandoned.

https://github.com/kiith-sa/D-YAML

Aug 29 2013

"Kiith-Sa" <kiithsacmp gmail.com> writes:

On Thursday, 29 August 2013 at 22:56:36 UTC, Chris wrote:
 On Thursday, 29 August 2013 at 19:26:21 UTC, Jacob Carlborg 
 wrote:
 On 2013-08-29 16:07, Chris wrote:

 And while we're at it, what about YAML? It's a subset of JSON 
 which
 means the new json.d module will handle it, I suppose.

 YAML is a super set of JSON, not the other way around. But 
 yes, I would like to have YAML support as well.

 Yes of course, you are right. I found this on the internet. 
 Seems to be abandoned.

 https://github.com/kiith-sa/D-YAML

It's not really abandoned, I keep updating it with compatibility 
fixes for new DMD releases as my other projects depend on it.

Its API does not fit into Phobos, however (not range-based), and 
it won't unless I find a few weeks/months to work on it 
exclusively, which is unlikely in the near future.

It also only supports YAML 1.1 at the moment, and recursive data 
structures are not yet supported.

Aug 29 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 29, 2013 15:20:39 Jacob Carlborg wrote:
 On 2013-08-29 11:23, Jonathan M Davis wrote:
 IIRC, everything in XML is
 ASCII anyway, with stuff like HTML codes to indicate Unicode characters.
 And if that's the case, avoiding unnecessary decoding is trivial when
 operating on strings.

 
 What! I hardly believe that. That might be the case for HTML but I don't
 think it is for XML. There are many file formats that are based on XML.
 I don't think all those use HTML codes.
 
 This is what W3 Schools says:
 
 "XML documents can contain non ASCII characters, like Norwegian æ ø å ,
 or French ê è é.
 
 To avoid errors, specify the XML encoding, or save XML files as Unicode.".

Well, as I said, I couldn't remember exactly what the XML standard said about 
encodings, but if it can contain non-ASCII characters, then my first 
inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the 
fact that that's what we support in the language and in Phobos (as I 
understand it, std.encodings is a bit of a joke that needs to be rethought and 
replaced, but regardless, it's the only Phobos module supporting any non-
Unicode encodings).

However, because all of the XML special symbols should be ASCII, you should 
still be able to avoid decoding characters for the most part. It's only when 
you have to actually look at the content that Unicode would potentially 
matter. So, the performance hit of decoding Unicode characters should mostly 
be able to be avoided.

- Jonathan M Davis

Aug 29 2013

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 29 August 2013 at 17:38:43 UTC, Jonathan M Davis 
wrote:
 Well, as I said, I couldn't remember exactly what the XML 
 standard said about
 encodings, but if it can contain non-ASCII characters, then my 
 first
 inclination is to say that it has to be UTF-8, UTF-16, or 
 UTF-32 based on the
 fact that that's what we support in the language and in Phobos 
 (as I
 understand it, std.encodings is a bit of a joke that needs to 
 be rethought and
 replaced, but regardless, it's the only Phobos module 
 supporting any non-
 Unicode encodings).

 However, because all of the XML special symbols should be 
 ASCII, you should
 still be able to avoid decoding characters for the most part. 
 It's only when
 you have to actually look at the content that Unicode would 
 potentially
 matter. So, the performance hit of decoding Unicode characters 
 should mostly
 be able to be avoided.

 - Jonathan M Davis

You just specify the encoding in the root element.

<?xml version="1.0" encoding="us-ascii"?>
<?xml version="1.0" encoding="windows-1252"?>
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-16"?>

UTF-8 is the default in lieu of a BOM saying otherwise.

Aug 29 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-29 19:38, Jonathan M Davis wrote:

 However, because all of the XML special symbols should be ASCII, you should
 still be able to avoid decoding characters for the most part. It's only when
 you have to actually look at the content that Unicode would potentially
 matter. So, the performance hit of decoding Unicode characters should mostly
 be able to be avoided.

I don't understand. If use a range of dchar and call "front" and 
"popFront" won't it do decoding then?

-- 
/Jacob Carlborg

Aug 29 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 29, 2013 21:28:09 Jacob Carlborg wrote:
 On 2013-08-29 19:38, Jonathan M Davis wrote:
 However, because all of the XML special symbols should be ASCII, you
 should
 still be able to avoid decoding characters for the most part. It's only
 when you have to actually look at the content that Unicode would
 potentially matter. So, the performance hit of decoding Unicode
 characters should mostly be able to be avoided.

 
 I don't understand. If use a range of dchar and call "front" and
 "popFront" won't it do decoding then?

Any decent parser is going to special-case strings (especially if it's using 
slicing), in which case, it won't call front unless it needs to decode. The 
only real question is whether generic char and wchar ranges should be 
supported, because then you could avoid the decoding for ranges that aren't 
strings, but strings are already covered simply by special casing. You really
can't afford to not special-case for strings for algorithms in general if
efficiency is a high priority.

- Jonathan M Davis

Aug 29 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-08-29 17:38:23 +0000, "Jonathan M Davis" <jmdavisProg gmx.com> said:

 Well, as I said, I couldn't remember exactly what the XML standard said about
 encodings, but if it can contain non-ASCII characters, then my first
 inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the
 fact that that's what we support in the language and in Phobos (as I
 understand it, std.encodings is a bit of a joke that needs to be rethought and
 replaced, but regardless, it's the only Phobos module supporting any non-
 Unicode encodings).

The XML standard says that an XML parser MUST support UTF-8 and UTF-16, 
and MAY support other encodings.

Supporting non-UTF-8 encodings is a separate problem from parsing XML, 
and proper code for that would have much broader applications. Keep in 
mind that the more encoding you support, the more bloat you add to the 
executable, so there's a tradeoff to be made. In many cases, UTF-8 is 
enough, while in many others it's not.

(My XML implementation has a function that parses the XML prolog and 
tells you the encoding so you can take the appropriate code path before 
feeding the parser. A higher level API could handle encodings 
automatically based on that that. )


 However, because all of the XML special symbols should be ASCII, you should
 still be able to avoid decoding characters for the most part. It's only when
 you have to actually look at the content that Unicode would potentially
 matter. So, the performance hit of decoding Unicode characters should mostly
 be able to be avoided.

Just like my XML implementation does. (I made frontUnit/popFrontUnit 
functions I'm using when decoding code points is unnecessary.)


-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Aug 29 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis wrote:
[...]
 Well, as I said, I couldn't remember exactly what the XML standard said about 
 encodings, but if it can contain non-ASCII characters, then my first 
 inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the 
 fact that that's what we support in the language and in Phobos

Take a look here:

	http://www.w3schools.com/xml/xml_encoding.asp

XML files can have *any* valid encoding, including nastiness like
windows-1252 and relics like iso-8859-1. Unfortunately, I don't think we
have a way around this, since existing XML files out there probably
already have all of these encodings are more, and std.xml is gonna hafta
support 'em all. Otherwise we're gonna get irate users complaining "why
can't std.xml parse my oddly-encoded-but-standards-compliant XML file?!"


 (as I understand it, std.encodings is a bit of a joke that needs to be
 rethought and replaced, but regardless, it's the only Phobos module
 supporting any non- Unicode encodings).

No kidding! I was trying to write a program that navigates a website
automatically using std.net.curl, and I'm running into all sorts of
silly roadblocks, including std.encoding not supporting iso-8859-*
encodings.

The good news is that on Linux, there's a handy utility called 'recode',
which comes with a library called 'librecode', that supports converting
between a huge number of different encodings -- many more than probably
you or I have imagined existed -- including to/from Unicode.  I know we
don't like including external libraries in Phobos, but I honestly don't
see any justification for reinventing the wheel by writing (and
maintaining!) our own equivalent to librecode, unless licensing issues
prevents us from including librecode in Phobos, nicely wrapped in a
modern range-based D API.


 However, because all of the XML special symbols should be ASCII, you
 should still be able to avoid decoding characters for the most part.
 It's only when you have to actually look at the content that Unicode
 would potentially matter. So, the performance hit of decoding Unicode
 characters should mostly be able to be avoided.

[...]

One way is to write the core code of std.xml in such a way that it
handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit
encodings) so that it's encoding-independent. Then on top of this core,
write some convenience wrappers that casts/converts to string, wstring,
dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32
if the user asks for string/wstring/dstring, and leave XML in other
encodings up to the user to decode manually. This way, at least the user
can get the data out of the file.

Later on, once we've gotten our act together with std.encoding, we can
hook it up to std.xml to provide autoconversion.


T

-- 
Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen

Aug 29 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-29 20:57, H. S. Teoh wrote:

 XML files can have *any* valid encoding, including nastiness like
 windows-1252 and relics like iso-8859-1.

Actually, does the encoding really matters (as long as it's compatible 
with ASCII). Just use a range of ubytes, the parser will only be looking 
for characters in the ASCII table anyway.

-- 
/Jacob Carlborg

Aug 29 2013

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 29 August 2013 at 18:58:57 UTC, H. S. Teoh wrote:
 No kidding! I was trying to write a program that navigates a 
 website
 automatically using std.net.curl, and I'm running into all 
 sorts of
 silly roadblocks, including std.encoding not supporting 
 iso-8859-*
 encodings.

It doesn't look like adding the rest of the ISO-8859 encodings 
would be all that difficult if you used the existing ISO-8859-1 
(Latin1) as a base.  I don't quite understand where and how 
transcoding is done though.

 The good news is that on Linux, there's a handy utility called 
 'recode',
 which comes with a library called 'librecode', that supports 
 converting
 between a huge number of different encodings -- many more than 
 probably
 you or I have imagined existed -- including to/from Unicode.  I 
 know we
 don't like including external libraries in Phobos, but I 
 honestly don't
 see any justification for reinventing the wheel by writing (and
 maintaining!) our own equivalent to librecode, unless licensing 
 issues
 prevents us from including librecode in Phobos, nicely wrapped 
 in a
 modern range-based D API.


 However, because all of the XML special symbols should be 
 ASCII, you
 should still be able to avoid decoding characters for the most 
 part.
 It's only when you have to actually look at the content that 
 Unicode
 would potentially matter. So, the performance hit of decoding 
 Unicode
 characters should mostly be able to be avoided.

 [...]

 One way is to write the core code of std.xml in such a way that 
 it
 handles all data as ubyte[] (or ushort[]/uint[] for 
 16-bit/32-bit
 encodings) so that it's encoding-independent. Then on top of 
 this core,
 write some convenience wrappers that casts/converts to string, 
 wstring,
 dstring. As an initial stab, we could support only UTF-8, 
 UTF-16, UTF-32
 if the user asks for string/wstring/dstring, and leave XML in 
 other
 encodings up to the user to decode manually. This way, at least 
 the user
 can get the data out of the file.

 Later on, once we've gotten our act together with std.encoding, 
 we can
 hook it up to std.xml to provide autoconversion.


 T

Aug 29 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 29 August 2013 at 18:58:57 UTC, H. S. Teoh wrote:
 On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis 
 wrote:
 [...]
 Well, as I said, I couldn't remember exactly what the XML 
 standard said about encodings, but if it can contain non-ASCII 
 characters, then my first inclination is to say that it has to 
 be UTF-8, UTF-16, or UTF-32 based on the fact that that's what 
 we support in the language and in Phobos

 Take a look here:

 	http://www.w3schools.com/xml/xml_encoding.asp

 XML files can have *any* valid encoding, including nastiness 
 like
 windows-1252 and relics like iso-8859-1. Unfortunately, I don't 
 think we
 have a way around this, since existing XML files out there 
 probably
 already have all of these encodings are more, and std.xml is 
 gonna hafta
 support 'em all. Otherwise we're gonna get irate users 
 complaining "why
 can't std.xml parse my oddly-encoded-but-standards-compliant 
 XML file?!"

As this is not the first time I see it used as a reliable source, 
no, w3school is full of shit. Don't use that website when looking 
for precise high quality information.

Aug 31 2013

Sean Kelly <sean invisibleduck.org> writes:

On Aug 29, 2013, at 11:57 AM, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:
=20
 One way is to write the core code of std.xml in such a way that it
 handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit
 encodings) so that it's encoding-independent. Then on top of this =

core,
 write some convenience wrappers that casts/converts to string, =

wstring,
 dstring. As an initial stab, we could support only UTF-8, UTF-16, =

UTF-32
 if the user asks for string/wstring/dstring, and leave XML in other
 encodings up to the user to decode manually. This way, at least the =

user
 can get the data out of the file.
=20
 Later on, once we've gotten our act together with std.encoding, we can
 hook it up to std.xml to provide autoconversion.

As long autoconversion is optional.  When parsing XML or JSON or =
whatever, I generally only care about specific strings, and sometimes =
don't want anything decoded at all.  Having decoding done automatically =
before the event fires is a huge and potentially unnecessary performance =
hit.  Not doing this decoding automatically is what makes the Tango XML =
parser so fast.=

Aug 29 2013

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 29 August 2013 at 20:08:10 UTC, Sean Kelly wrote:
 As long autoconversion is optional.  When parsing XML or JSON 
 or whatever, I generally only care about specific strings, and 
 sometimes don't want anything decoded at all.  Having decoding 
 done automatically before the event fires is a huge and 
 potentially unnecessary performance hit.  Not doing this 
 decoding automatically is what makes the Tango XML parser so 
 fast.

This makes me wonder what kind of optimizations a hypothetical 
ctXml could perform.

Aug 29 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 29, 2013 at 12:41:16PM -0700, Sean Kelly wrote:
 On Aug 29, 2013, at 11:57 AM, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:
 
 One way is to write the core code of std.xml in such a way that it
 handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit
 encodings) so that it's encoding-independent. Then on top of this
 core, write some convenience wrappers that casts/converts to string,
 wstring, dstring. As an initial stab, we could support only UTF-8,
 UTF-16, UTF-32 if the user asks for string/wstring/dstring, and
 leave XML in other encodings up to the user to decode manually. This
 way, at least the user can get the data out of the file.
 
 Later on, once we've gotten our act together with std.encoding, we
 can hook it up to std.xml to provide autoconversion.

 
 As long autoconversion is optional.  When parsing XML or JSON or
 whatever, I generally only care about specific strings, and sometimes
 don't want anything decoded at all.  Having decoding done
 automatically before the event fires is a huge and potentially
 unnecessary performance hit.  Not doing this decoding automatically is
 what makes the Tango XML parser so fast.

Right, that's why I said the core of std.xml should handle everything as
bytes, only specially treating the ASCII values of <, >, &, and other
metacharacters. The tagname and tag body should just be a range over
segments of the input.


T

-- 
What are you when you run out of Monet? Baroque.

Aug 29 2013

"Joakim" <joakim airpost.net> writes:

On Thursday, 29 August 2013 at 07:47:35 UTC, Jonathan M Davis 
wrote:
 There are several D XML libraries floating around, but no one 
 has taken the
 time to get any of the prepared for the Phobos review queue, 
 and I suspect
 that very few of them are range-based like the Phobos XML 
 solution needs to
 be, but I don't know.

I think it's great that there's no std.xml, as it implies that 
nobody using D would use a dumb tech like XML.  Let's keep it 
that way. :)

Aug 29 2013

maarten van damme <maartenvd1994 gmail.com> writes:

and imagine someone forced to use xml who reads this answer from the
community :p

std.xml is a must, no doubt.


2013/8/29 Joakim <joakim airpost.net>

 On Thursday, 29 August 2013 at 07:47:35 UTC, Jonathan M Davis wrote:

 There are several D XML libraries floating around, but no one has taken
 the
 time to get any of the prepared for the Phobos review queue, and I suspect
 that very few of them are range-based like the Phobos XML solution needs
 to
 be, but I don't know.

 I think it's great that there's no std.xml, as it implies that nobody
 using D would use a dumb tech like XML.  Let's keep it that way. :)

Aug 29 2013

"Chris" <wendlec tcd.ie> writes:

On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
 I think it's great that there's no std.xml, as it implies that 
 nobody using D would use a dumb tech like XML.  Let's keep it 
 that way. :)

No way around XML. A must have, as has been said in this thread. 
But what would you suggest as a better alternative to XML. It 
might be worth creating modules for alternative too (like JSON).

Aug 29 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Aug 29, 2013 at 01:14:19PM +0200, Chris wrote:
 On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
I think it's great that there's no std.xml, as it implies that
nobody using D would use a dumb tech like XML.  Let's keep it that
way. :)

 
 No way around XML. A must have, as has been said in this thread. But
 what would you suggest as a better alternative to XML. It might be
 worth creating modules for alternative too (like JSON).

While I do agree that in the current state of affairs, XML support is a
must, I also think that XML is just way overengineered, IMNSHO. It has
adds too much overhead and therefore requires compression to be
efficient, and it is needlessly complex for what it does (tag
attributes, all the different cases of CDATA / non-CDATA, etc.). This
complexity makes it impractical to edit by hand, relegating it to
machine reading/writing only, which then begs the question of why a
binary format wasn't chosen instead. And don't get me started on DTDs,
which are incredibly convoluted and can't even express certain things
that one might want to express in an automatic validation system. Or
that 17-headed monster called XSLT, which, thankfully, is fading into
the obscurity of time.

JSON is a nicer, simpler alternative, though there may be limitations
with it that I don't know about. Word on the street is that many people
are abandoning XML for JSON due to lower maintenance overhead (and this
includes one of my friends, who was a hardcore XML fanatic -- I was
frankly quite surprised when he told me he was considering migrating to
JSON, since the original reason he chose XML was so that his data will
future-proof... well, so much for *that*).

But all of this is irrelevant... it doesn't alleviate the need for a
std.xml replacement, since we have to live in the real world where XML
exists and must be supported. :)


T

-- 
Life would be easier if I had the source code. -- YHL

Aug 29 2013

"Chris" <wendlec tcd.ie> writes:

On Thursday, 29 August 2013 at 15:43:36 UTC, H. S. Teoh wrote:
 While I do agree that in the current state of affairs, XML 
 support is a
 must, I also think that XML is just way overengineered, IMNSHO. 
 It has
 adds too much overhead and therefore requires compression to be
 efficient, and it is needlessly complex for what it does (tag
 attributes, all the different cases of CDATA / non-CDATA, 
 etc.). This
 complexity makes it impractical to edit by hand, relegating it 
 to
 machine reading/writing only, which then begs the question of 
 why a
 binary format wasn't chosen instead. And don't get me started 
 on DTDs,
 which are incredibly convoluted and can't even express certain 
 things
 that one might want to express in an automatic validation 
 system. Or
 that 17-headed monster called XSLT, which, thankfully, is 
 fading into
 the obscurity of time.

 JSON is a nicer, simpler alternative, though there may be 
 limitations
 with it that I don't know about. Word on the street is that 
 many people
 are abandoning XML for JSON due to lower maintenance overhead 
 (and this
 includes one of my friends, who was a hardcore XML fanatic -- I 
 was
 frankly quite surprised when he told me he was considering 
 migrating to
 JSON, since the original reason he chose XML was so that his 
 data will
 future-proof... well, so much for *that*).

 But all of this is irrelevant... it doesn't alleviate the need 
 for a
 std.xml replacement, since we have to live in the real world 
 where XML
 exists and must be supported. :)


 T

I am moving away from XML too. Wanted to use it for a private 
project. But I soon realized the madness of it, especially when 
there are people involved who are not programmers and have no 
clue whatsoever about markup languages, data storage formats etc. 
I think JSON and YAML are good candidates for the private project 
which revolves around collecting words and phrases and archiving 
them. I don't know exactly what I will use, but XML definitely 
won't get the job.

DTD sounds too much like DDT!

Aug 29 2013

"Jason den Dulk" <public2 jasondendulk.com> writes:

On Thursday, 29 August 2013 at 15:43:36 UTC, H. S. Teoh wrote:
 JSON is a nicer, simpler alternative, though there may be 
 limitations
 with it that I don't know about.

The main disavantage of JSON vs XML is lack of validation. 
Whenever I write code that works with JSON (or any data format), 
I have to write extra code to perform validation. If there was a 
validation addon for JSON, you could nix XML for good.

Regards
Jason

Aug 29 2013

"Joakim" <joakim airpost.net> writes:

On Thursday, 29 August 2013 at 11:14:21 UTC, Chris wrote:
 On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
 I think it's great that there's no std.xml, as it implies that 
 nobody using D would use a dumb tech like XML.  Let's keep it 
 that way. :)

 No way around XML. A must have, as has been said in this 
 thread. But what would you suggest as a better alternative to 
 XML. It might be worth creating modules for alternative too 
 (like JSON).

We already have a std.json in Phobos for years now.  I think it'd 
be great for Phobos to nudge users in better directions, by 
having a std.json but no std.xml.  There'll always be outside 
libraries to process XML, for those who can't go without, perhaps 
a list of XML libraries can be added to the wiki:

http://wiki.dlang.org/Libraries_and_Frameworks

I see no use for XML, as it's a horrible solution in search of a 
problem, but for those who must use it, they can always get it 
outside Phobos.  Just a suggestion.

Aug 29 2013

"w0rp" <devw0rp gmail.com> writes:

On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote:
 I think it's great that there's no std.xml, as it implies that 
 nobody using D would use a dumb tech like XML.  Let's keep it 
 that way. :)

JSON is better than XML in every way I can think of. Easier to 
map to data structures in whichever language you're using, much 
smaller in size, less corner cases, etc. However, just saying XML 
is dumb isn't a useful policy. You need ways of parsing XML on 
hand until people stop using it.

On Thursday, 29 August 2013 at 08:15:39 UTC, Robert Schadek wrote:
 On 08/29/2013 09:51 AM, Johannes Pfau wrote:
 I most points here also apply to std.xml:
 t Those are not strict
 requirements though, I just summarized what I remembered from 
 old
 discussions.

 I think, this even extends to access to all semi- and 
 structured-data.
 Think csv, sql nosql, you name it. Something which deserves a 
 name like
 Uniform Access. I don't want to care if data is laid out 
 differently. I
 want to define my struct or class mark the members to fill a 
 pass it to
 somebodies code and don't want to care if its xml, sql or 
 whatever.

I'm really not so sure about that kind of approach. Automatic 
serialisation I think works one of two ways. Either you have 
control over the data you're pulling in, and you can change it to 
map more easily to your data structures, or you don't and you 
have to make your data structures more ugly to fit the data 
you're pulling in. I prefer just writing functions that take 
format X and give you in-memory representation Y over automatic 
serialisation stuff. I know it's boring and easy to write 
functions like that, but why can't some things just be boring and 
easy?

This looks like a really popular topic, and it's cool that there 
seem to be quite a few implementations that are close to being 
what we want. I think we're probably not far off just lining up a 
few different implementations and reviewing them all for possible 
inclusion in phobos.

Aug 29 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-08-29 07:47:17 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:

 On Thursday, August 29, 2013 09:25:35 w0rp wrote:
 The general idea in my mind is
 "something SAX-like, with something a little DOM-like."

 
 What I personally think would be best is to have multiple parsers. First you
 have something STAX-like (or maybe even lower level - I don't recall exactly
 what STAX gives you at the moment) that basically tokenizes the XML and
 returns a range of that. Then SAX and DOM parsers can be built on top of that.
 That way, you get the fastest parser possible as well as higher level, more
 functional parsers.
 
 But two of the biggest points of the design are that it's going to have to be
 range-based, and it's going to need to be able to take full advantage of
 slices (when used with any strings or random-access ranges) in order to avoid
 copying any of the data. That's the key design point which will allow a D
 parser to be extremely fast in comparison to parsers in most other languages.

I wrote something like that a while ago.

It only accepted arrays as input because of the lack of a "buffered 
range" concept that'd allow lookahead and efficient slicing from any 
kind of range, but that could be retrofitted in. It implements pretty 
much all of the XML spec, except for documents having an internal 
subset (which is something a little arcane). It does not deal with 
namespaces either, I feel like that should be done a layer above, but 
I'm not entirely sure.

Lower-level parser:
http://michelf.ca/docs/d/mfr/xmltok.html

Higher-level parser built on the first one:
http://michelf.ca/docs/d/mfr/xml.html

The code:
http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip

That code hasn't been compiled in a while, but it used to work very 
well for me. Feel free to use as a starting point.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Aug 29 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 29, 2013 12:14:28 Michel Fortin wrote:
 On 2013-08-29 07:47:17 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:
 On Thursday, August 29, 2013 09:25:35 w0rp wrote:
 The general idea in my mind is
 "something SAX-like, with something a little DOM-like."

 
 What I personally think would be best is to have multiple parsers. First
 you have something STAX-like (or maybe even lower level - I don't recall
 exactly what STAX gives you at the moment) that basically tokenizes the
 XML and returns a range of that. Then SAX and DOM parsers can be built on
 top of that. That way, you get the fastest parser possible as well as
 higher level, more functional parsers.
 
 But two of the biggest points of the design are that it's going to have to
 be range-based, and it's going to need to be able to take full advantage
 of slices (when used with any strings or random-access ranges) in order
 to avoid copying any of the data. That's the key design point which will
 allow a D parser to be extremely fast in comparison to parsers in most
 other languages.

 I wrote something like that a while ago.
 
 It only accepted arrays as input because of the lack of a "buffered
 range" concept that'd allow lookahead and efficient slicing from any
 kind of range, but that could be retrofitted in. It implements pretty
 much all of the XML spec, except for documents having an internal
 subset (which is something a little arcane). It does not deal with
 namespaces either, I feel like that should be done a layer above, but
 I'm not entirely sure.
 
 Lower-level parser:
 http://michelf.ca/docs/d/mfr/xmltok.html
 
 Higher-level parser built on the first one:
 http://michelf.ca/docs/d/mfr/xml.html
 
 The code:
 http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip
 
 That code hasn't been compiled in a while, but it used to work very
 well for me. Feel free to use as a starting point.

Cool. I started looking at implementing something like that a while back but 
really didn't have time to get very far. But if we really care about
efficiency, 
I think that that's the basic approach that we need to take. However, the trick
as always is someone having the time to do it. Maybe one of us can take what
you did and start from there or at least use it is an example to start from.

- Jonathan M Davis

Aug 29 2013

"ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:

On Thursday, 29 August 2013 at 16:14:28 UTC, Michel Fortin wrote:
 I wrote something like that a while ago.

 It only accepted arrays as input because of the lack of a 
 "buffered range" concept that'd allow lookahead and efficient 
 slicing from any kind of range, but that could be retrofitted 
 in. It implements pretty much all of the XML spec, except for 
 documents having an internal subset (which is something a 
 little arcane). It does not deal with namespaces either, I feel 
 like that should be done a layer above, but I'm not entirely 
 sure.

 Lower-level parser:
 http://michelf.ca/docs/d/mfr/xmltok.html

 Higher-level parser built on the first one:
 http://michelf.ca/docs/d/mfr/xml.html

 The code:
 http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip

 That code hasn't been compiled in a while, but it used to work 
 very well for me. Feel free to use as a starting point.

Can you push it to the github, please?

Sep 03 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-09-03 16:11:37 +0000, "ilya-stromberg" 
<ilya-stromberg-2009 yandex.ru> said:

 On Thursday, 29 August 2013 at 16:14:28 UTC, Michel Fortin wrote:
 
 I wrote something like that a while ago.
 
 It only accepted arrays as input because of the lack of a "buffered 
 range" concept that'd allow lookahead and efficient slicing from any 
 kind of range, but that could be retrofitted in. It implements pretty 
 much all of the XML spec, except for documents having an internal 
 subset (which is something a little arcane). It does not deal with 
 namespaces either, I feel like that should be done a layer above, but 
 I'm not entirely sure.
 
 Lower-level parser:
 http://michelf.ca/docs/d/mfr/xmltok.html
 
 Higher-level parser built on the first one:
 http://michelf.ca/docs/d/mfr/xml.html
 
 The code:
 http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip
 
 That code hasn't been compiled in a while, but it used to work very 
 well for me. Feel free to use as a starting point.

 
 Can you push it to the github, please?

Good idea.

http://github.com/michelf/mfr-xml-d

Feel free to send pull requests if you want. I should be able to review them.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Sep 03 2013

Johannes Pfau <nospam example.com> writes:

Am Thu, 29 Aug 2013 09:25:35 +0200
schrieb "w0rp" <devw0rp gmail.com>:

 Hello everybody. I've been wondering, what are the current plans 
 to replace std.xml? I'd like to help with the effort to get a 
 final XML library in phobos. So, I have a few questions.
 
 First, and most importantly, what do we except out of a D XML 
 library? I'd really like to have a discussion of the form, "Here 
 is exactly the interface the structs/classes need to implement, 
 go forth and implement." The general idea in my mind is 
 "something SAX-like, with something a little DOM-like." I'm aware 
 that std.xml has some issues support different encodings, so 
 obvious that's included.

I most points here also apply to std.xml:
http://wiki.dlang.org/Wish_list/std.json
Those are not strict requirements though, I just summarized what I
remembered from old discussions.

 Second, is there an existing library that has gotten close to 
 meeting whatever we need for the first point? If so, how far away 
 is it from being able to meet all of the requirements and become 
 the standard library version?

There's a std.xml2 in the review queue:
http://wiki.dlang.org/Review_Queue

Aug 29 2013

Robert Schadek <realburner gmx.de> writes:

On 08/29/2013 09:51 AM, Johannes Pfau wrote:
 I most points here also apply to std.xml:
 http://wiki.dlang.org/Wish_list/std.json Those are not strict
 requirements though, I just summarized what I remembered from old
 discussions.

I think, this even extends to access to all semi- and structured-data.
Think csv, sql nosql, you name it. Something which deserves a name like
Uniform Access. I don't want to care if data is laid out differently. I
want to define my struct or class mark the members to fill a pass it to
somebodies code and don't want to care if its xml, sql or whatever.

Aug 29 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-29 10:15, Robert Schadek wrote:


 I think, this even extends to access to all semi- and structured-data.
 Think csv, sql nosql, you name it. Something which deserves a name like
 Uniform Access. I don't want to care if data is laid out differently. I
 want to define my struct or class mark the members to fill a pass it to
 somebodies code and don't want to care if its xml, sql or whatever.

So you want serialization :). Which we currently are reviewing. 
Unfortunately there might be too many changes needed to get it in Phobos 
this time.

-- 
/Jacob Carlborg

Aug 29 2013

Robert Schadek <realburner gmx.de> writes:

On 08/29/2013 11:09 AM, Jacob Carlborg wrote:
 So you want serialization :). Which we currently are reviewing.
 Unfortunately there might be too many changes needed to get it in
 Phobos this time.

well, sort of, but also with partial serialization (think sql update),
more transparent interface and I want to define join results types at
compile time and and and

Aug 29 2013

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 29 August 2013 at 08:15:39 UTC, Robert Schadek wrote:
 On 08/29/2013 09:51 AM, Johannes Pfau wrote:
 I most points here also apply to std.xml:
 http://wiki.dlang.org/Wish_list/std.json Those are not strict
 requirements though, I just summarized what I remembered from 
 old
 discussions.

 I think, this even extends to access to all semi- and 
 structured-data.
 Think csv, sql nosql, you name it. Something which deserves a 
 name like
 Uniform Access. I don't want to care if data is laid out 
 differently. I
 want to define my struct or class mark the members to fill a 
 pass it to
 somebodies code and don't want to care if its xml, sql or 
 whatever.

That's a really great point.  All of these modules that can't 
know the types and structure in advance should probably all use 
the same techniques for handling the situation.  Perhaps a new 
module to unify all this stuff is in order.

I seem to recall Adam D. Ruppe's "Is this D or is this 
Javascript?" thread[1] having some nice tricks to deal with 
dynamically typed data.

1. 
http://forum.dlang.org/thread/kuxfkakrgjaofkrdvgmx forum.dlang.org

Aug 29 2013

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 29 August 2013 at 19:40:08 UTC, Brad Anderson wrote:
 That's a really great point.  All of these modules that can't 
 know the types and structure in advance should probably all use 
 the same techniques for handling the situation.  Perhaps a new 
 module to unify all this stuff is in order.

 I seem to recall Adam D. Ruppe's "Is this D or is this 
 Javascript?" thread[1] having some nice tricks to deal with 
 dynamically typed data.

 1. 
 http://forum.dlang.org/thread/kuxfkakrgjaofkrdvgmx forum.dlang.org

(or maybe just improve Variant)

Aug 29 2013

"Tobias Pankrath" <tobias pankrath.net> writes:

On Thursday, 29 August 2013 at 07:25:36 UTC, w0rp wrote:
 Hello everybody. I've been wondering, what are the current 
 plans to replace std.xml? I'd like to help with the effort to 
 get a final XML library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML 
 library? I'd really like to have a discussion of the form, 
 "Here is exactly the interface the structs/classes need to 
 implement, go forth and implement." The general idea in my mind 
 is "something SAX-like, with something a little DOM-like." I'm 
 aware that std.xml has some issues support different encodings, 
 so obvious that's included.

 Second, is there an existing library that has gotten close to 
 meeting whatever we need for the first point? If so, how far 
 away is it from being able to meet all of the requirements and 
 become the standard library version?

There is http://dsource.org/projects/xmlp, which at some point 
has been proposed for std.xml2. But that stalled for some time 
now.

Aug 29 2013

"ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:

On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath 
wrote:
 There is http://dsource.org/projects/xmlp, which at some point 
 has been proposed for std.xml2. But that stalled for some time 
 now.

Also, we have Tango Xml:
https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml

It's the fastest Xml parser in the world, so may be you can find 
it useful:
dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/
dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Aug 31 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-31 17:43, ilya-stromberg wrote:

 Also, we have Tango Xml:
 https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml

 It's the fastest Xml parser in the world, so may be you can find it useful:
 dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/

 dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Unfortunately the Tango XML package will never end up in Phobos due to 
licensing issues.

-- 
/Jacob Carlborg

Aug 31 2013

"ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:

On Saturday, 31 August 2013 at 18:03:10 UTC, Jacob Carlborg wrote:
 Unfortunately the Tango XML package will never end up in Phobos 
 due to licensing issues.

Yes, but we can always learn source code and put attention to the 
design solutions.

Sep 01 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, September 01, 2013 10:02:50 ilya-stromberg wrote:
 On Saturday, 31 August 2013 at 18:03:10 UTC, Jacob Carlborg wrote:
 Unfortunately the Tango XML package will never end up in Phobos
 due to licensing issues.

 
 Yes, but we can always learn source code and put attention to the
 design solutions.

Not really. Looking at the source code effectively taints you. By doing so, you 
run the risk of being accused of copying if anything you do is similar enough. 
It's just safer to never look at source code when the license is going to make 
it so that you can't use that code.

- Jonathan M Davis

Sep 01 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-08-31 15:43:00 +0000, "ilya-stromberg" 
<ilya-stromberg-2009 yandex.ru> said:

 On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath wrote:
 There is http://dsource.org/projects/xmlp, which at some point has been 
 proposed for std.xml2. But that stalled for some time now.

 
 Also, we have Tango Xml:
 https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml
 
 It's the fastest Xml parser in the world, so may be you can find it useful:
 dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/
 dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Someone should benchmark it against the XML implementation I made. It 
has many of the same characteristics.

For instance, Tango's SaxParser is based on its PullParser. This design 
requires the use a dynamic array to maintain a stack of opened 
elements. While not a huge performance hit, you don't need that if you 
use recursion, which you can do with my implementation. You can do that 
even though you can also use it as a pull tokenizer[^1] when needed 
(recursion is optional on a token-by-token basis).

[^1]: IMHO, PullParser isn't a really good term for something that does 
not conform to the requirements of a parser in the XML spec. Tokenizer 
is a better term.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Aug 31 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-31 20:53, Michel Fortin wrote:

 [^1]: IMHO, PullParser isn't a really good term for something that does
 not conform to the requirements of a parser in the XML spec. Tokenizer
 is a better term.

I guess "Pull" is the key here. That it is the client's responsibility 
to fetch the next token, not the other way around.

-- 
/Jacob Carlborg

Aug 31 2013

"qznc" <qznc web.de> writes:

On Saturday, 31 August 2013 at 18:53:42 UTC, Michel Fortin wrote:
 On 2013-08-31 15:43:00 +0000, "ilya-stromberg" 
 <ilya-stromberg-2009 yandex.ru> said:

 On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath 
 wrote:
 There is http://dsource.org/projects/xmlp, which at some 
 point has been proposed for std.xml2. But that stalled for 
 some time now.

 
 Also, we have Tango Xml:
 https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml
 
 It's the fastest Xml parser in the world, so may be you can 
 find it useful:
 dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/
 dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

 Someone should benchmark it against the XML implementation I 
 made. It has many of the same characteristics.

 For instance, Tango's SaxParser is based on its PullParser. 
 This design requires the use a dynamic array to maintain a 
 stack of opened elements. While not a huge performance hit, you 
 don't need that if you use recursion, which you can do with my 
 implementation. You can do that even though you can also use it 
 as a pull tokenizer[^1] when needed (recursion is optional on a 
 token-by-token basis).

Recursion means you use the call stack instead of stack object on 
the heap.

Be careful about nesting deepness. There are XML documents out 
there with thousands and more nested elements. With recursion on 
a 32bit machine you might get a stack overflow, but a heap-stack 
could handle a million nested elements.

Sep 02 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-09-02 13:34:18 +0000, "qznc" <qznc web.de> said:

 On Saturday, 31 August 2013 at 18:53:42 UTC, Michel Fortin wrote:
 For instance, Tango's SaxParser is based on its PullParser. This design 
 requires the use a dynamic array to maintain a stack of opened 
 elements. While not a huge performance hit, you don't need that if you 
 use recursion, which you can do with my implementation. You can do that 
 even though you can also use it as a pull tokenizer[^1] when needed 
 (recursion is optional on a token-by-token basis).

 
 Recursion means you use the call stack instead of stack object on the heap.
 
 Be careful about nesting deepness. There are XML documents out there 
 with thousands and more nested elements. With recursion on a 32bit 
 machine you might get a stack overflow, but a heap-stack could handle a 
 million nested elements.

Good point about caring for pathological cases.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Sep 02 2013

Richard Webb <richard.webb boldonjames.com> writes:

On 31/08/2013 16:43, ilya-stromberg wrote:
 It's the fastest Xml parser in the world, so may be you can find it useful:
 dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/

 dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/


Has anyone done any benchmarks recently to see if that is still the case?


I did some (admitedly brief) tests last year and found that xmlp was 
actually faster at building large XML docs into a DOM. There have been 
lots of changes since then, so i don't know if that is still the case.

Sep 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/29/13 12:25 AM, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to
 replace std.xml? I'd like to help with the effort to get a final XML
 library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library?
 I'd really like to have a discussion of the form, "Here is exactly the
 interface the structs/classes need to implement, go forth and
 implement." The general idea in my mind is "something SAX-like, with
 something a little DOM-like." I'm aware that std.xml has some issues
 support different encodings, so obvious that's included.

 Second, is there an existing library that has gotten close to meeting
 whatever we need for the first point? If so, how far away is it from
 being able to meet all of the requirements and become the standard
 library version?

I don't know much about XML, but I noticed there are a few popular 
libraries and models. I'd expect a replacement for std.xml would choose 
one of these popular models that is most appropriate for D.


Andrei

Aug 29 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 29, 2013 14:27:22 H. S. Teoh wrote:
 Right, that's why I said the core of std.xml should handle everything as
 bytes, only specially treating the ASCII values of <, >, &, and other
 metacharacters. The tagname and tag body should just be a range over
 segments of the input.

That works especially well with how Michel and I were thinking it should be 
split up with a core that essentially just gives you a range of XML 
tokens/tags. You then have separate SAX and/or DOM parsers on top of that 
(which also should minimize decoding, but they actually have to care about 
decoding in some cases in order to do stuff like check matching tags).

- Jonathan M Davis

Aug 29 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 8/29/2013 12:25 AM, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to replace
 std.xml? I'd like to help with the effort to get a final XML library in phobos.
 So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library? I'd
 really like to have a discussion of the form, "Here is exactly the interface
the
 structs/classes need to implement, go forth and implement." The general idea in
 my mind is "something SAX-like, with something a little DOM-like." I'm aware
 that std.xml has some issues support different encodings, so obvious that's
 included.

 Second, is there an existing library that has gotten close to meeting whatever
 we need for the first point? If so, how far away is it from being able to meet
 all of the requirements and become the standard library version?

The Tango implementation of XML has been very well received. I haven't looked
at 
it, but it was designed to do no memory allocation - it just did slices over
the 
input.

I don't believe it should make any attempt at decoding. Decoding entails both 
performance loss and memory consumption. If the user wants to do decoding, they 
can layer it on the output.

And lastly, it should of course sport a range interface.

Aug 31 2013

Lionello Lunesu <lionello lunesu.remove.com> writes:

On 8/29/13 15:25, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to
 replace std.xml? I'd like to help with the effort to get a final XML
 library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library?
 I'd really like to have a discussion of the form, "Here is exactly the
 interface the structs/classes need to implement, go forth and
 implement." The general idea in my mind is "something SAX-like, with
 something a little DOM-like." I'm aware that std.xml has some issues
 support different encodings, so obvious that's included.

 Second, is there an existing library that has gotten close to meeting
 whatever we need for the first point? If so, how far away is it from
 being able to meet all of the requirements and become the standard
 library version?

Having been the lead programmer on the Microsoft XML team for three 
years, I can easily say that the most popular XML API [on MS stack] is 
the XmlReader and XLinq in .NET. (This has nothing to do with LINQ, by 
the way.)

I'd be willing to help make D versions of that, but my time is limited. 
But as usual, I don't think it's the actual coding that will take time. 
Designing a good interface is the hardest part, and I'd consider that 
part done.

L.

Sep 02 2013

Peter Williams <pwil3058 bigpond.net.au> writes:

On 03/09/13 12:40, Lionello Lunesu wrote:
 On 8/29/13 15:25, w0rp wrote:
 Hello everybody. I've been wondering, what are the current plans to
 replace std.xml? I'd like to help with the effort to get a final XML
 library in phobos. So, I have a few questions.

 First, and most importantly, what do we except out of a D XML library?
 I'd really like to have a discussion of the form, "Here is exactly the
 interface the structs/classes need to implement, go forth and
 implement." The general idea in my mind is "something SAX-like, with
 something a little DOM-like." I'm aware that std.xml has some issues
 support different encodings, so obvious that's included.

 Second, is there an existing library that has gotten close to meeting
 whatever we need for the first point? If so, how far away is it from
 being able to meet all of the requirements and become the standard
 library version?

 Having been the lead programmer on the Microsoft XML team for three
 years, I can easily say that the most popular XML API [on MS stack] is
 the XmlReader and XLinq in .NET. (This has nothing to do with LINQ, by
 the way.)

 I'd be willing to help make D versions of that, but my time is limited.
 But as usual, I don't think it's the actual coding that will take time.
 Designing a good interface is the hardest part, and I'd consider that
 part done.

 L.

For whoever ends up doing std.xml's replacement, it would be good if 
some of the lower level interfaces such as encode/decode (for 
escaping/unescaping within text) were exposed.  I'm finding the ones in 
std.xml useful for implementing markup in label widgets during my 
investigation into reimplementing the GTK+ (modified) interface in D.

Of course, there's always the chance that the new D xml API provides 
enough to make my markup code redundant.  It's possible that the current 
high level APIs in std.xml also provides enough make my work redundant 
but I decided not to investigate this possibility after I saw the "will 
be replaced by something different" warning.

Cheers
Peter

Sep 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/2/13 7:40 PM, Lionello Lunesu wrote:
 Having been the lead programmer on the Microsoft XML team for three
 years, I can easily say that the most popular XML API [on MS stack] is
 the XmlReader and XLinq in .NET. (This has nothing to do with LINQ, by
 the way.)

 I'd be willing to help make D versions of that, but my time is limited.
 But as usual, I don't think it's the actual coding that will take time.
 Designing a good interface is the hardest part, and I'd consider that
 part done.

This is great info, thanks.

Andrei

Sep 03 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Replacing std.xml