www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - html fetcher/parser

reply Faux Amis <faux amis.com> writes:
I would like to get into D again by making a small program which fetches 
a website every X-time and keeps track of all changes within specified 
dom elements.

fetching: should I go for std curl, vibe.d or something else?
parsing: I could only find these dub packages: htmld & libdominator.
And they don't seem overly active, any recommendations?

As I haven't been using D for some time I just don't want to get off 
with a bad start :)
thx
Aug 12 2017
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
 I would like to get into D again by making a small program 
 which fetches a website every X-time and keeps track of all 
 changes within specified dom elements.
My dom.d and http2.d combine to make this easy: https://github.com/adamdruppe/arsd/blob/master/dom.d https://github.com/adamdruppe/arsd/blob/master/http2.d and support file for random encodings: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d Or via dub: http://code.dlang.org/packages/arsd-official the dom and http subpackages are the ones you want. Docs: http://dpldocs.info/arsd.dom Sample program: --- // compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings} import std.stdio; import arsd.dom; void main() { auto document = Document.fromUrl("https://dlang.org/"); writeln(document.optionSelector("p").innerText); } --- Output: D is a general-purpose programming language with static typing, systems-level access, and C-like syntax. It combines efficiency, control and modeling power with safety and programmer productivity. Note that the https support requires OpenSSL available on your system. Works best on Linux with it installed as a devel lib (so like openssl-devel or whatever, just like you would if using it from C). How it works: Document.fromUrl uses the http lib to fetch it, then automatically parse the contents as a dom document. It will correct for common errors in webpage markup, character sets, etc. Document and Element both have various methods for navigating, modifying, and accessing the DOM tree. Here, I used `optionSelector`, which works like `querySelector` in Javascript (and the same syntax is used for CSS), returning the first matching element. querySelector, however, returns null if there is nothing found. optionSelector returns a dummy object instead, so you don't have to explicitly test it for null and instead just access its methods. `innerText` returns the text inside, stripped of markup. You might also want `innerHTML`, or `toString` to get the whole thing, markup and all. there's a lot more you can do too but just these few functions I think will be enough for your task. Bonus fact: http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDi tanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....
Aug 12 2017
next sibling parent Michael <michael toohuman.io> writes:
On Saturday, 12 August 2017 at 20:22:44 UTC, Adam D. Ruppe wrote:
 On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
 [...]
My dom.d and http2.d combine to make this easy: https://github.com/adamdruppe/arsd/blob/master/dom.d https://github.com/adamdruppe/arsd/blob/master/http2.d [...]
Sometimes it feels like there's the standard D library, Phobos, and then for everything else you have already developed a suitable library to supplement it haha!
Aug 12 2017
prev sibling parent reply Faux Amis <faux amis.com> writes:
On 2017-08-12 22:22, Adam D. Ruppe wrote:
 On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
 [...]
[...] --- // compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings} import std.stdio; import arsd.dom; void main() { auto document = Document.fromUrl("https://dlang.org/"); writeln(document.optionSelector("p").innerText); } ---
Nice!
 [...]
 Document.fromUrl uses the http lib to fetch it, then automatically parse 
 the contents as a dom document. It will correct for common errors in 
 webpage markup, character sets, etc.
Just curious, but is there a spec of sorts which defines which errors should be fixed and such?
 [...] 
 Bonus fact: 
 http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDi
tanceAndPath.1.html 
 that function from the standard library makes doing a diff display of 
 before and after pretty simple....
Thanks for the pointer!
Aug 13 2017
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
 Just curious, but is there a spec of sorts which defines which 
 errors should be fixed and such?
The HTML5 spec describes how you are supposed to parse various things, including the recovery paths for broken markup. My module, however, isn't so formal. I just used it for a web scraping thing at work that hit a few hundred sites and fixed bugs as they came up to give good enough results for me.... (one thing I found is a lot of sites claiming to be UTF-8 are actually latin-1, so it validates and falls back to handle that. My http thing, while buggier, is similar - I hit a server once that ignored the accept gzip header and always sent it anyway, so I had to handle that... and I noticed curl actually didn't!) So on the one hand, there's surely still bugs and weird cases, but on the other hand, it did get a fair chunk of real-world use so I am fairly confident it will be ok for most things.
Aug 13 2017
parent reply Faux Amis <faux amis.com> writes:
On 2017-08-13 19:51, Adam D. Ruppe wrote:
 On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
 Just curious, but is there a spec of sorts which defines which errors 
 should be fixed and such?
The HTML5 spec describes how you are supposed to parse various things, including the recovery paths for broken markup. My module, however, isn't so formal. I just used it for a web scraping thing at work that hit a few hundred sites and fixed bugs as they came up to give good enough results for me.... (one thing I found is a lot of sites claiming to be UTF-8 are actually latin-1, so it validates and falls back to handle that. My http thing, while buggier, is similar - I hit a server once that ignored the accept gzip header and always sent it anyway, so I had to handle that... and I noticed curl actually didn't!) So on the one hand, there's surely still bugs and weird cases, but on the other hand, it did get a fair chunk of real-world use so I am fairly confident it will be ok for most things.
Sounds good! (Althought following the spec would be the first step to a D html layout engine :D )
Aug 14 2017
parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 14 August 2017 at 23:15:13 UTC, Faux Amis wrote:
 (Althought following the spec would be the first step to a D 
 html layout engine :D )
Oh, I've actually done some of that before too. https://github.com/adamdruppe/arsd/blob/master/htmlwidget.d It is pretty horrible... but managed to render my old homepage which used css float, boxes, and basic tables. I don't know if it still compiles, I haven't even tried it for years.
Aug 14 2017
prev sibling parent reply Soulsbane <paul acheronsoft.com> writes:
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
 I would like to get into D again by making a small program 
 which fetches a website every X-time and keeps track of all 
 changes within specified dom elements.

 fetching: should I go for std curl, vibe.d or something else?
 parsing: I could only find these dub packages: htmld & 
 libdominator.
 And they don't seem overly active, any recommendations?

 As I haven't been using D for some time I just don't want to 
 get off with a bad start :)
 thx
I've the requests module nice to work with: http://code.dlang.org/packages/requests
Aug 12 2017
parent Faux Amis <faux amis.com> writes:
On 2017-08-13 01:49, Soulsbane wrote:
 On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
 I would like to get into D again by making a small program which 
 fetches a website every X-time and keeps track of all changes within 
 specified dom elements.

 fetching: should I go for std curl, vibe.d or something else?
 parsing: I could only find these dub packages: htmld & libdominator.
 And they don't seem overly active, any recommendations?

 As I haven't been using D for some time I just don't want to get off 
 with a bad start :)
 thx
I've the requests module nice to work with: http://code.dlang.org/packages/requests
Thanks, looks nice! I'll try it if Adam's modules fail me :)
Aug 13 2017