www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Web crawler/scraping

reply Carlos Cabral <cmpscabral gmail.com> writes:
Hi,
I'm trying to collect some json data from a website/admin panel 
automatically, which is behind a login form.

Is there a D library that can help me with this?

Thank you
Feb 17
next sibling parent reply Ferhat =?UTF-8?B?S3VydHVsbXXFnw==?= <aferust gmail.com> writes:
On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral 
wrote:
 Hi,
 I'm trying to collect some json data from a website/admin panel 
 automatically, which is behind a login form.

 Is there a D library that can help me with this?

 Thank you
I found this but it looks outdated: https://github.com/gedaiu/selenium.d
Feb 17
parent Carlos Cabral <cmpscabral gmail.com> writes:
On Wednesday, 17 February 2021 at 12:27:16 UTC, Ferhat KurtulmuĊŸ 
wrote:
 On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral 
 wrote:
 Hi,
 I'm trying to collect some json data from a website/admin 
 panel automatically, which is behind a login form.

 Is there a D library that can help me with this?

 Thank you
I found this but it looks outdated: https://github.com/gedaiu/selenium.d
Thanks! This seems to depend on Selenium, I was looking for something standalone, like crawler.get(...) crawler.post(...) crawler.parse(...) so that I can deploy the executable in the client's network as a single executable (the website I'm crawling is only available internally...).
Feb 17
prev sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral 
wrote:
 I'm trying to collect some json data from a website/admin panel 
 automatically, which is behind a login form.
Does the website need javascript? If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work). Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff. I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
Feb 17
next sibling parent Carlos Cabral <cmpscabral gmail.com> writes:
On Wednesday, 17 February 2021 at 13:13:00 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral 
 wrote:
 I'm trying to collect some json data from a website/admin 
 panel automatically, which is behind a login form.
Does the website need javascript? If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work). Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff. I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
No, I don't think it needs JS. I think can submit the login form and then just fetch/save the json request using the login cookie as you suggest. A crawler/scraping solution maybe overkill... I'll try with std.net.curl and come back to you in a couple of hours Thank you!!
Feb 17
prev sibling parent Carlos Cabral <cmpscabral gmail.com> writes:
On Wednesday, 17 February 2021 at 13:13:00 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral 
 wrote:
 I'm trying to collect some json data from a website/admin 
 panel automatically, which is behind a login form.
Does the website need javascript? If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work). Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff. I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
...and it's working :) thank you Adam and Ferhat leaving this here if anyone needs: ``` import std.stdio; import std.string; import std.net.curl; import core.thread; void main() { int waitTime = 5; auto domain = "https://example.com"; auto cookiesFile = "cookies.txt"; auto http = HTTP(); http.handle.set(CurlOption.use_ssl, 1); http.handle.set(CurlOption.ssl_verifypeer, 0); http.handle.set(CurlOption.cookiefile, cookiesFile); http.handle.set(CurlOption.cookiejar , cookiesFile); http.setUserAgent("..."); http.onReceive = (ubyte[] data) { (...) } http.method = HTTP.Method.get; http.url = domain ~ "/login"; http.perform(); Thread.sleep(waitTime.seconds); auto data = "username=user&password=pass"; http.method = HTTP.Method.post; http.url = domain ~ "/login"; http.setPostData(data, "application/x-www-form-urlencoded"); http.perform(); Thread.sleep(waitTime.seconds); http.method = HTTP.Method.get; http.url = domain ~ "/fetchjson"; http.perform(); } ```
Feb 17