www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Class for fetching a web page and parse into DOM

reply breezes <wangyuanzju gmail.com> writes:
Is there a class that can fetch a web page from the internet? And is std.xml
the right module for parsing it
into a DOM tree?
Dec 15 2011
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote:
 Is there a class that can fetch a web page from the internet? 
 And is std.xml the right module for parsing it
 into a DOM tree?
You might want to use my dom.d https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff Grab dom.d, characterencodings.d, and curl.d. Here's an example program: ==== import arsd.dom; import arsd.curl; import std.stdio; void main() { auto document = new Document(); document.parseGarbage(curl("http://digitalmars.com/")); writeln(document.querySelector("p")); } ===== Compile like this: dmd yourfile.d dom.d characterencodings.d curl.d You'll need the curl C library from an outside source. If you're on Linux, it is probably already installed. If you're on Windows, check the Internet. // this downloads a file from the web and returns a string curl(site url); // this builds a DOM tree out of html. It's called parseGarbage because // it tries to figure out really bad html - so it works on a lot of web // sites. document.parseGarbage(string); // My dom.d includes a lot of functions you might know from // javascript like getElementById, getElementsByTagName, and the // get element by CSS selector functions document.querySelector("p") // get the first paragraph And then, finally, the writeln puts out the html of an element.
Dec 15 2011
parent "Nick Sabalausky" <a a.a> writes:
"Adam D. Ruppe" <destructionator gmail.com> wrote in message 
news:nlccexskkftzaapfdnti dfeed.kimsufi.thecybershadow.net...
 On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote:
 Is there a class that can fetch a web page from the internet? And is 
 std.xml the right module for parsing it
 into a DOM tree?
You might want to use my dom.d https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff Grab dom.d, characterencodings.d, and curl.d. Here's an example program: ==== import arsd.dom; import arsd.curl; import std.stdio; void main() { auto document = new Document(); document.parseGarbage(curl("http://digitalmars.com/")); writeln(document.querySelector("p")); } ===== Compile like this: dmd yourfile.d dom.d characterencodings.d curl.d You'll need the curl C library from an outside source. If you're on Linux, it is probably already installed. If you're on Windows, check the Internet. // this downloads a file from the web and returns a string curl(site url); // this builds a DOM tree out of html. It's called parseGarbage because // it tries to figure out really bad html - so it works on a lot of web // sites. document.parseGarbage(string); // My dom.d includes a lot of functions you might know from // javascript like getElementById, getElementsByTagName, and the // get element by CSS selector functions document.querySelector("p") // get the first paragraph And then, finally, the writeln puts out the html of an element.
Yup, I can confirm Adam's tools are great for this. At the moment, std.xml is known to have problems and is currently undergoing a rewrite.
Dec 15 2011