digitalmars.D.learn - Class for fetching a web page and parse into DOM

breezes (2/2) Dec 15 2011 Is there a class that can fetch a web page from the internet? And is std...

Adam D. Ruppe (33/36) Dec 15 2011 You might want to use my dom.d

Nick Sabalausky (4/38) Dec 15 2011 Yup, I can confirm Adam's tools are great for this. At the moment, std.x...

breezes <wangyuanzju gmail.com> writes:

Is there a class that can fetch a web page from the internet? And is std.xml
the right module for parsing it
into a DOM tree?

Dec 15 2011

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote:
 Is there a class that can fetch a web page from the internet? 
 And is std.xml the right module for parsing it
 into a DOM tree?

You might want to use my dom.d

https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

Grab dom.d, characterencodings.d, and curl.d.

Here's an example program:

====
import arsd.dom;
import arsd.curl;

import std.stdio;

void main() {
	auto document = new Document();
	document.parseGarbage(curl("http://digitalmars.com/"));

	writeln(document.querySelector("p"));
}
=====

Compile like this:

dmd yourfile.d dom.d characterencodings.d curl.d

You'll need the curl C library from an outside source. If you're
on Linux, it is probably already installed. If you're on Windows,
check the Internet.

// this downloads a file from the web and returns a string
curl(site url);

// this builds a DOM tree out of html. It's called parseGarbage 
because
// it tries to figure out really bad html - so it works on a lot 
of web
// sites.
document.parseGarbage(string);

// My dom.d includes a lot of functions you might know from
// javascript like getElementById, getElementsByTagName, and the
// get element by CSS selector functions
document.querySelector("p") // get the first paragraph


And then, finally, the writeln puts out the html of an element.

Dec 15 2011

"Nick Sabalausky" <a a.a> writes:

"Adam D. Ruppe" <destructionator gmail.com> wrote in message 
news:nlccexskkftzaapfdnti dfeed.kimsufi.thecybershadow.net...
 On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote:
 Is there a class that can fetch a web page from the internet? And is 
 std.xml the right module for parsing it
 into a DOM tree?

 You might want to use my dom.d

 https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

 Grab dom.d, characterencodings.d, and curl.d.

 Here's an example program:

 ====
 import arsd.dom;
 import arsd.curl;

 import std.stdio;

 void main() {
 auto document = new Document();
 document.parseGarbage(curl("http://digitalmars.com/"));

 writeln(document.querySelector("p"));
 }
 =====

 Compile like this:

 dmd yourfile.d dom.d characterencodings.d curl.d

 You'll need the curl C library from an outside source. If you're
 on Linux, it is probably already installed. If you're on Windows,
 check the Internet.

 // this downloads a file from the web and returns a string
 curl(site url);

 // this builds a DOM tree out of html. It's called parseGarbage because
 // it tries to figure out really bad html - so it works on a lot of web
 // sites.
 document.parseGarbage(string);

 // My dom.d includes a lot of functions you might know from
 // javascript like getElementById, getElementsByTagName, and the
 // get element by CSS selector functions
 document.querySelector("p") // get the first paragraph


 And then, finally, the writeln puts out the html of an element.

Yup, I can confirm Adam's tools are great for this. At the moment, std.xml 
is known to have problems and is currently undergoing a rewrite.

Dec 15 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Class for fetching a web page and parse into DOM