www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Learning to XML with D

reply "Derix" <derix dexample.com> writes:
So, I set sails to transform a bunch of HTML files with D. This, 
of course, will happen with the std.xml library.

There is this nice example :
http://dlang.org/phobos/std_xml.html#.DocumentParser
that I put to some use already, however some of the basics seem 
to escape me, specially in lines like

     xml.onEndTag["author"]       = (in Element e) { book.author   
    = e.text(); };

OK, we're doing some event-base parsing, reacting with a lambda 
function on encountering so-and-do tag, à la SAX. (are we ?)

What I don't quite grab is the construct (in Element e) , 
especially the *in* part.

Is it *in* as in http://dlang.org/expression.html#InExpression ? 
In which case I fail to see what associative array we're 
considering.

It's probably more a way to further qualify the argument e were 
passing to the  λ-function : could someone elaborate on that ?

Of course, it is entirely possible that I completely miss the 
point and that I'm overlooking some fundamentals, if so have 
mercy and help me find my way back to teh righteous path ;-)


Thxxx
Feb 06 2015
next sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
 So, I set sails to transform a bunch of HTML files with D. 
 This, of course, will happen with the std.xml library.

 There is this nice example :
 http://dlang.org/phobos/std_xml.html#.DocumentParser
 that I put to some use already, however some of the basics seem 
 to escape me, specially in lines like

     xml.onEndTag["author"]       = (in Element e) { book.author
    = e.text(); };

 OK, we're doing some event-base parsing, reacting with a lambda 
 function on encountering so-and-do tag, à la SAX. (are we ?)

 What I don't quite grab is the construct (in Element e) , 
 especially the *in* part.

 Is it *in* as in http://dlang.org/expression.html#InExpression 
 ? In which case I fail to see what associative array we're 
 considering.

 It's probably more a way to further qualify the argument e were 
 passing to the  λ-function : could someone elaborate on that ?

 Of course, it is entirely possible that I completely miss the 
 point and that I'm overlooking some fundamentals, if so have 
 mercy and help me find my way back to teh righteous path ;-)


 Thxxx
The documentation says: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." My advice is not to use it. I used it a while back, but it slowed down my system (why I still don't know), and it is permanently soon-to-be deprecated. If you wanna use D for XML parsing, see if you can find a solid 3rd party library in D (have a look at Adam's github page: https://github.com/adamdruppe/, he has some DOM and HTML stuff up there). There is a new xml module in the review queue, but nobody seems to care. I _think_ the reason why nobody really cares is that most people in the D community don't like XML.
Feb 06 2015
next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 6 February 2015 at 11:39:32 UTC, Chris wrote:
 If you wanna use D for XML parsing, see if you can find a solid 
 3rd party library in D (have a look at Adam's github page: 
 https://github.com/adamdruppe/, he has some DOM and HTML stuff 
 up there).
Another place to look is http://code.dlang.org/ , which contains packages usable with DUB. There you can find KXML, for example: http://code.dlang.org/packages/kxml
 There is a new xml module in the review queue, but nobody seems 
 to care. I _think_ the reason why nobody really cares is that 
 most people in the D community don't like XML.
:-P I think the reason is simply that someone has to do the actual work of pushing things forward. And to make matters worse, std.xml2 is marked as abandoned, so it would first have to be brought back into form before it can even be submitted.
Feb 06 2015
prev sibling next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 6 February 2015 at 11:39:32 UTC, Chris wrote:
 If you wanna use D for XML parsing, see if you can find a solid 
 3rd party library in D (have a look at Adam's github page: 
 https://github.com/adamdruppe/, he has some DOM and HTML stuff 
 up there).
Yeah, if you're used to DOM work in Javascript, my dom.d works in a familiar way - it offers similar attributes, methods, uses css selector syntax if you want, etc.. You can download just that one file then build your program like "dmd yourfile.d dom.d" and it should just work, it has no outside dependencies. Mine can do almost any xml, but the out of the box experience is focused on HTML. When combined with my characterencodings.d from the same repo, it can handle most web pages too, making it useful for scraping html sites.
 There is a new xml module in the review queue, but nobody seems 
 to care. I _think_ the reason why nobody really cares is that 
 most people in the D community don't like XML.
I don't have a problem with xml.... but my own lib Works For Me (tm) so I don't personally care much about what is or isn't in phobos....
Feb 06 2015
parent "Derix" <derix dexample.com> writes:
 my dom.d works in a familiar way
OK, will check it
 useful for scraping html sites.
Not exactly what I'm doing, but close. I'm in the midst of a self-training spree, and what I use as test-tubes fodder is the following : a collection of 300+ html files constituting an electronic version of a technical book. My intent is to generate a clickable table of contents, by parsing the files for css styles specific to section headers. The first leg of the journey was to normalize styles accross the bunch. That is done, more or less. I already have a proto-toc, but not entirely satisfying : lacks handles for propper styling, and the way I arrived there is kinda brutish. One hurdle I haven't overcame yet is that the text content, and the section headers themsleves, contain some html tags (well, the book /is/ about html, among other things). For example, some section headers are rendered as two bold lines, with a fat <br/> in the middle, and <b></b> around. So when I parse the payload of the <p> element, I end up with some &lt;br/&gt; in the middle of a sentence. Survivable, but unclean. So yeah, I'll give it another try with your dom.d
Feb 09 2015
prev sibling parent reply "CraigDillabaugh" <craig.dillabaugh gmail.com> writes:
On Friday, 6 February 2015 at 11:39:32 UTC, Chris wrote:
 On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
 So, I set sails to transform a bunch of HTML files with D. 
 This, of course, will happen with the std.xml library.

 There is this nice example :
 http://dlang.org/phobos/std_xml.html#.DocumentParser
 that I put to some use already, however some of the basics 
 seem to escape me, specially in lines like

    xml.onEndTag["author"]       = (in Element e) { book.author
   = e.text(); };

 OK, we're doing some event-base parsing, reacting with a 
 lambda function on encountering so-and-do tag, à la SAX. (are 
 we ?)

 What I don't quite grab is the construct (in Element e) , 
 especially the *in* part.

 Is it *in* as in http://dlang.org/expression.html#InExpression 
 ? In which case I fail to see what associative array we're 
 considering.

 It's probably more a way to further qualify the argument e 
 were passing to the  λ-function : could someone elaborate on 
 that ?

 Of course, it is entirely possible that I completely miss the 
 point and that I'm overlooking some fundamentals, if so have 
 mercy and help me find my way back to teh righteous path ;-)


 Thxxx
The documentation says: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." My advice is not to use it. I used it a while back, but it slowed down my system (why I still don't know), and it is permanently soon-to-be deprecated. If you wanna use D for XML parsing, see if you can find a solid 3rd party library in D (have a look at Adam's github page: https://github.com/adamdruppe/, he has some DOM and HTML stuff up there). There is a new xml module in the review queue, but nobody seems to care. I _think_ the reason why nobody really cares is that most people in the D community don't like XML.
I added XML to the GSOC idea's page (see Phobos section), but it still needs a mentor. Are you busy this summer? http://wiki.dlang.org/GSOC_2015_Ideas#Phobos:_D_Standard_Library
Feb 06 2015
parent reply "CraigDillabaugh" <craig.dillabaugh gmail.com> writes:
On Friday, 6 February 2015 at 14:09:51 UTC, CraigDillabaugh wrote:
 On Friday, 6 February 2015 at 11:39:32 UTC, Chris wrote:
 On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
clip
 Thxxx
The documentation says: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." My advice is not to use it. I used it a while back, but it slowed down my system (why I still don't know), and it is permanently soon-to-be deprecated. If you wanna use D for XML parsing, see if you can find a solid 3rd party library in D (have a look at Adam's github page: https://github.com/adamdruppe/, he has some DOM and HTML stuff up there). There is a new xml module in the review queue, but nobody seems to care. I _think_ the reason why nobody really cares is that most people in the D community don't like XML.
I added XML to the GSOC idea's page (see Phobos section), but it still needs a mentor. Are you busy this summer? http://wiki.dlang.org/GSOC_2015_Ideas#Phobos:_D_Standard_Library
Just for the record, I hate XML too, but it is VERY widely used, so good XML support is essential ... like it or not!
Feb 06 2015
parent reply "Chris" <wendlec tcd.ie> writes:
On Friday, 6 February 2015 at 14:11:19 UTC, CraigDillabaugh wrote:
 On Friday, 6 February 2015 at 14:09:51 UTC, CraigDillabaugh 
 wrote:
 On Friday, 6 February 2015 at 11:39:32 UTC, Chris wrote:
 On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
clip
 Thxxx
The documentation says: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." My advice is not to use it. I used it a while back, but it slowed down my system (why I still don't know), and it is permanently soon-to-be deprecated. If you wanna use D for XML parsing, see if you can find a solid 3rd party library in D (have a look at Adam's github page: https://github.com/adamdruppe/, he has some DOM and HTML stuff up there). There is a new xml module in the review queue, but nobody seems to care. I _think_ the reason why nobody really cares is that most people in the D community don't like XML.
I added XML to the GSOC idea's page (see Phobos section), but it still needs a mentor. Are you busy this summer? http://wiki.dlang.org/GSOC_2015_Ideas#Phobos:_D_Standard_Library
Just for the record, I hate XML too, but it is VERY widely used, so good XML support is essential ... like it or not!
You're right of course. It is widely (and wildly) used. I for my part have changed my input files from XML to a simpler custom format. PS I am busy this summer. But maybe Adam's dom.d can be used as a basis for a new module, unlike std.xml2 it's not abandoned.
Feb 06 2015
parent "CraigDillabaugh" <craig.dillabaugh gmail.com> writes:
On Friday, 6 February 2015 at 14:15:44 UTC, Chris wrote:
 On Friday, 6 February 2015 at 14:11:19 UTC, CraigDillabaugh 
 wrote:
 On Friday, 6 February 2015 at 14:09:51 UTC, CraigDillabaugh 
 wrote:
 On Friday, 6 February 2015 at 11:39:32 UTC, Chris wrote:
 On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
clip
 Thxxx
The documentation says: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." My advice is not to use it. I used it a while back, but it slowed down my system (why I still don't know), and it is permanently soon-to-be deprecated. If you wanna use D for XML parsing, see if you can find a solid 3rd party library in D (have a look at Adam's github page: https://github.com/adamdruppe/, he has some DOM and HTML stuff up there). There is a new xml module in the review queue, but nobody seems to care. I _think_ the reason why nobody really cares is that most people in the D community don't like XML.
I added XML to the GSOC idea's page (see Phobos section), but it still needs a mentor. Are you busy this summer? http://wiki.dlang.org/GSOC_2015_Ideas#Phobos:_D_Standard_Library
Just for the record, I hate XML too, but it is VERY widely used, so good XML support is essential ... like it or not!
You're right of course. It is widely (and wildly) used. I for my part have changed my input files from XML to a simpler custom format. PS I am busy this summer. But maybe Adam's dom.d can be used as a basis for a new module, unlike std.xml2 it's not abandoned.
Thanks for the tip. I may add a reference there!
Feb 06 2015
prev sibling next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
 OK, we're doing some event-base parsing, reacting with a lambda 
 function on encountering so-and-do tag, à la SAX. (are we ?)
yeah
 What I don't quite grab is the construct (in Element e) , 
 especially the *in* part.
Function parameters in D can be qualified as in or out, optionally: http://dlang.org/function.html#parameters (in Element e) means you are taking an argument of type Element that you only intend to take in to look at. An "in" parameter is const and you are not supposed to store a reference to it. So basically, `in` on a function parameter means "look, don't touch".
Feb 06 2015
parent "Derix" <derix dexample.com> writes:
 What I don't quite grab is the construct (in Element e) , 
 especially the *in* part.
Function parameters in D can be qualified as in or out, optionally:
But of course. Actually I kinda found out just a little while after posting the question. Asking questions is a great way to figure out the answer, so thank you for reading mines ;-) Thank you for your answer too, which consolidates my guess and makes me think I still have some thinking to do about the life of a function parameter. I was a bit puzzled too as to where the "Element e" comes from, how is it that it's already instanciated and all. Well, I've just found the relevant part of the documentation. To be honest, said documentation is not always easy to navigate or to decrypt. I sense some potential for progress here.
Feb 09 2015
prev sibling parent "Arjan" <arjan ask.me.to> writes:
On Friday, 6 February 2015 at 09:15:54 UTC, Derix wrote:
 So, I set sails to transform a bunch of HTML files with D. 
 This, of course, will happen with the std.xml library.

 There is this nice example :
 http://dlang.org/phobos/std_xml.html#.DocumentParser
 that I put to some use already, however some of the basics seem 
 to escape me, specially in lines like

     xml.onEndTag["author"]       = (in Element e) { book.author
    = e.text(); };

 OK, we're doing some event-base parsing, reacting with a lambda 
 function on encountering so-and-do tag, à la SAX. (are we ?)

 What I don't quite grab is the construct (in Element e) , 
 especially the *in* part.

 Is it *in* as in http://dlang.org/expression.html#InExpression 
 ? In which case I fail to see what associative array we're 
 considering.

 It's probably more a way to further qualify the argument e were 
 passing to the  λ-function : could someone elaborate on that ?

 Of course, it is entirely possible that I completely miss the 
 point and that I'm overlooking some fundamentals, if so have 
 mercy and help me find my way back to teh righteous path ;-)


 Thxxx
Maybe, when you're on windows, you could use msxml6 through COM. You have DOM, SAX, Xpath 1.0 and XSLT at your disposal.
Feb 07 2015