digitalmars.D.learn - For those ready to take the challenge

eles (1/1) Jan 09 2015 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups...

Justin Whear (3/4) Jan 09 2015 stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

Adam D. Ruppe (31/32) Jan 09 2015 Well, as the author of my dom.d, I think it counts as a first

Adam D. Ruppe (3/3) Jan 09 2015 Huh, looking at the answers on the website, they're mostly using

Justin Whear (4/7) Jan 09 2015 Yes, I noticed that. `` isn't a
"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (9/12) Jan 10 2015 Yeah... Surprising, since languages like python includes a HTML
Tobias Pankrath (3/6) Jan 10 2015 Since it is a comparison of languages it's okay to match the

Adam D. Ruppe (19/21) Jan 10 2015 I don't think this is really a great comparison of languages

Tobias Pankrath (13/45) Jan 10 2015 I think he's wrong, because it spoils the comparison. Every

Adam D. Ruppe (10/17) Jan 10 2015 Yeah, that would be best. BTW interesting line here:
Paulo Pinto (22/38) Jan 10 2015 I disagree.
"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (10/17) Jan 10 2015 The challenge is completely pointless. Different languages have

Adam D. Ruppe (61/65) Jan 10 2015 Though, that's still a library thing rather than a language thing.

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (13/21) Jan 10 2015 It is a language-library-platform thing, things like how

Adam D. Ruppe (20/26) Jan 10 2015 Of course. It does it both ways:

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/24) Jan 10 2015 Both these code examples triggers the same assert()

Adam D. Ruppe (6/7) Jan 10 2015 Don't use git master :P

bearophile (4/5) Jan 10 2015 Is the issue in Bugzilla?

Adam D. Ruppe (4/5) Jan 10 2015 I don't know, bugzilla is extremely difficult to search.
Adam D. Ruppe (2/3) Jan 10 2015 https://issues.dlang.org/show_bug.cgi?id=13966

Vladimir Panteleev (6/9) Jan 10 2015 Do use git master. The more people do, the fewer regressions will

Jesse Phillips (3/4) Jan 09 2015 Link to answer in D:

Andrei Alexandrescu (2/7) Jan 09 2015 Nailed it. -- Andrei
Vladimir Panteleev (14/18) Jan 09 2015 I think byLine is not necessary. By default . will not match line

Daniel Kozak via Digitalmars-d-learn (4/28) Jan 10 2015 Oh here is it, I was looking for each. I think it is allready in a

MattCoder (7/8) Jan 10 2015 From the link: "Let's show Stroustrup what small and readable

"eles" <eles eles.com> writes:

https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

Jan 09 2015

Justin Whear <justin economicmodeling.com> writes:

On Fri, 09 Jan 2015 13:50:28 +0000, eles wrote:

 https://codegolf.stackexchange.com/questions/44278/debunking-

stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

Was excited to give it a try, then remembered...std.xml  :(

Jan 09 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 9 January 2015 at 16:55:30 UTC, Justin Whear wrote:
 Was excited to give it a try, then remembered...std.xml  :(

Well, as the author of my dom.d, I think it counts as a first 
party library when I use it!

---

import arsd.dom;
import std.net.curl;
import std.stdio, std.algorithm;

void main() {
	auto document = new Document(cast(string) 
get("http://www.stroustrup.com/C++.html"));
	writeln(document.querySelectorAll("a[href]").map!(a=>a.href));
}

---

prints:
[snip ... "http://www.morganstanley.com/", 
"http://www.cs.columbia.edu/", "http://www.cse.tamu.edu", 
"index.html", "C++.html", "bs_faq.html", "bs_faq2.html", 
"C++11FAQ.html", "papers.html", "4th.html", "Tour.html", 
"programming.html", "dne.html", "bio.html", "interviews.html", 
"applications.html", "glossary.html", "compilers.html"]



Or perhaps better yet:

import arsd.dom;
import std.net.curl;
import std.stdio;

void main() {
	auto document = new Document(cast(string) 
get("http://www.stroustrup.com/C++.html"));
	foreach(a; document.querySelectorAll("a[href]"))
		writeln(a.href);
}

Which puts each one on a separate line.

Jan 09 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

Huh, looking at the answers on the website, they're mostly using 
regular expressions. Weaksauce. And wrong - they don't find ALL 
the links, they find the absolute HTTP urls!

Jan 09 2015

Justin Whear <justin economicmodeling.com> writes:

On Fri, 09 Jan 2015 17:18:42 +0000, Adam D. Ruppe wrote:

 Huh, looking at the answers on the website, they're mostly using regular
 expressions. Weaksauce. And wrong - they don't find ALL the links, they
 find the absolute HTTP urls!

Yes, I noticed that.  `<script src="http://app.js"`></script>` isn't a 
"hyperlink".

Wake up sheeple!

Jan 09 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Friday, 9 January 2015 at 17:18:43 UTC, Adam D. Ruppe wrote:
 Huh, looking at the answers on the website, they're mostly 
 using regular expressions. Weaksauce. And wrong - they don't 
 find ALL the links, they find the absolute HTTP urls!

Yeah... Surprising, since languages like python includes a HTML 
parser in the standard library.

Besides, if you want all resource links you have to do a lot 
better, since the following attributes can contain resource 
addresses: href, src, data, cite, xlink:href…

You also need to do entity expansion since the links can contain 
html entities like "&amp;".

Depressing.

Jan 10 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

On Friday, 9 January 2015 at 17:18:43 UTC, Adam D. Ruppe wrote:
 Huh, looking at the answers on the website, they're mostly 
 using regular expressions. Weaksauce. And wrong - they don't 
 find ALL the links, they find the absolute HTTP urls!

Since it is a comparison of languages it's okay to match the 
original behaviour.

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 12:34:42 UTC, Tobias Pankrath 
wrote:
 Since it is a comparison of languages it's okay to match the 
 original behaviour.

I don't think this is really a great comparison of languages 
either though because it is gluing together a couple library 
tasks. Only a few bits about the actual language are showing 
through.


In the given regex solutions, C++ has an advantage over C wherein 
the regex structure can be freed automatically in a destructor 
and a raw string literal in here, but that's about all from the 
language itself. The original one is kinda long because he didn't 
use a http get library, not because the language couldn't do one.

There are bits where the language can make those libraries nicer 
too: dom.d uses operator overloading and opDispatch to support 
things like .attribute and also .attr.X and .style.foo and 
element["selector"].addClass("foo") and so on implemented in very 
very little code - I didn't have to manually list methods for the 
collection or properties for the attributes - ...but a library 
*could* do it that way and get similar results for the end user; 
the given posts wouldn't show that.

Jan 10 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

On Saturday, 10 January 2015 at 15:13:27 UTC, Adam D. Ruppe wrote:
 On Saturday, 10 January 2015 at 12:34:42 UTC, Tobias Pankrath 
 wrote:
 Since it is a comparison of languages it's okay to match the 
 original behaviour.

 I don't think this is really a great comparison of languages 
 either though because it is gluing together a couple library 
 tasks. Only a few bits about the actual language are showing 
 through.


 In the given regex solutions, C++ has an advantage over C 
 wherein the regex structure can be freed automatically in a 
 destructor and a raw string literal in here, but that's about 
 all from the language itself. The original one is kinda long 
 because he didn't use a http get library, not because the 
 language couldn't do one.

 There are bits where the language can make those libraries 
 nicer too: dom.d uses operator overloading and opDispatch to 
 support things like .attribute and also .attr.X and .style.foo 
 and element["selector"].addClass("foo") and so on implemented 
 in very very little code - I didn't have to manually list 
 methods for the collection or properties for the attributes - 
 ...but a library *could* do it that way and get similar results 
 for the end user; the given posts wouldn't show that.

I agree and one of the answers says:

 I think the "no third-party" assumption is a fallacy. And is a 
 specific fallacy that afflicts C++ developers, since it's so 
 hard to make reusable code in C++. When you are developing 
 anything at all, even if it's a small script, you will always 
 make use of whatever pieces of reusable code are available to 
 you.

 The thing is, in languages like Perl, Python, Ruby (to name a 
 few), reusing
 someone else's code is not only easy, but it is how most people 
 actually write code most of the time.

I think he's wrong, because it spoils the comparison. Every 
answer should delegate those tasks to a library that Stroustroup 
used as well, e.g. regex matching, string to number conversion 
and some kind of TCP sockets. But it must do the same work that 
he's solution does: Create and parse HTML header and extract the 
html links, probably using regex, but I wouldn't mind another 
solution.

Everyone can put a libdo_the_stroustroup_thing on dub and then 
call do_the_stroustroup_thing() in main. To compare what the 
standard libraries (and libraries easily obtained or quasi 
standard) offer is another challenge.

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath 
wrote:
 But it must do the same work that he's solution does: Create 
 and parse HTML header and extract the html links, probably 
 using regex, but I wouldn't mind another solution.

Yeah, that would be best. BTW interesting line here:

    s << "GET " << "http://" + server + "/" + file << " 
HTTP/1.0\r\n";
     s << "Host: " << server << "\r\n";

Why + instead of <<? C++'s usage of << is totally blargh to me 
anyway, but seeing both is even stranger.

Weird language, weird library.

 Everyone can put a libdo_the_stroustroup_thing on dub and then 
 call do_the_stroustroup_thing() in main. To compare what the 
 standard libraries (and libraries easily obtained or quasi 
 standard) offer is another challenge.

Yeah.

Jan 10 2015

"Paulo Pinto" <pjmlp progtools.org> writes:

On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath 
wrote:
 ...

 The thing is, in languages like Perl, Python, Ruby (to name a 
 few), reusing
 someone else's code is not only easy, but it is how most 
 people actually write code most of the time.

 I think he's wrong, because it spoils the comparison. Every 
 answer should delegate those tasks to a library that 
 Stroustroup used as well, e.g. regex matching, string to number 
 conversion and some kind of TCP sockets. But it must do the 
 same work that he's solution does: Create and parse HTML header 
 and extract the html links, probably using regex, but I 
 wouldn't mind another solution.

 Everyone can put a libdo_the_stroustroup_thing on dub and then 
 call do_the_stroustroup_thing() in main. To compare what the 
 standard libraries (and libraries easily obtained or quasi 
 standard) offer is another challenge.

I disagree.

The great thing about comes with batteries runtimes is that I 
have the guarantee the desired features exist in all platforms 
supported by the language.

If the libraries are dumped into a repository, there is always a 
problem if the library works across all OS supported by the 
language or even if they work together at all. Specially if they 
depend on common packages with incompatible versions.

This is the cause of so many string and vector types across all 
C++ libraries as most of those libraries were developed before 
C++98 was even done.

Or why C runtime isn't nothing more than a light version of UNIX 
as it was back in 1989, without any worthwhile feature since 
then, besides some extra support for numeric types and a little 
more secure libraries.

Nowadays, unless I am doing something very OS specific, I hardly 
care which OS I am using, thanks to such "comes with batteries" 
runtimes.


--
Paulo

Jan 10 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath 
wrote:
 I think he's wrong, because it spoils the comparison. Every 
 answer should delegate those tasks to a library that 
 Stroustroup used as well, e.g. regex matching, string to number 
 conversion and some kind of TCP sockets. But it must do the 
 same work that he's solution does: Create and parse HTML header 
 and extract the html links, probably using regex, but I 
 wouldn't mind another solution.

The challenge is completely pointless. Different languages have 
different ways of hacking together a compact incorrect solution. 
How to directly translate a C++ hack into another language is a 
task for people who are drunk.

For the challenge to make sense it would entail parsing all legal 
HTML5 documents, extracting all resource links, converting them 
into absolute form and printing them one per line. With no 
hickups.

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad 
wrote:
 For the challenge to make sense it would entail parsing all 
 legal HTML5 documents, extracting all resource links, 
 converting them into absolute form and printing them one per 
 line. With no hickups.

Though, that's still a library thing rather than a language thing.

dom.d and the Url struct in cgi.d should be able to do all that, 
in just a few lines even, but that's just because I've done a 
*lot* of web scraping with the libs before so I made them work 
for that.

In fact.... let me to do it. I'll use my http2.d instead of 
cgi.d, actually, it has a similar Url struct just more focused on 
client requests.


import arsd.dom;
import arsd.http2;
import std.stdio;

void main() {
	auto base = Uri("http://www.stroustrup.com/C++.html");
         // http2 is a newish module of mine that aims to imitate
         // a browser in some ways (without depending on curl btw)
	auto client = new HttpClient();
	auto request = client.navigateTo(base);
	auto document = new Document();

         // and http2 provides an asynchonous api but you can
         // pretend it is sync by just calling waitForCompletion
	auto response = request.waitForCompletion();

         // parseGarbage uses a few tricks to fixup invalid/broken 
HTML
         // tag soup and auto-detect character encodings, 
including when
         // it lies about being UTF-8 but is actually Windows-1252
	document.parseGarbage(response.contentText);

         // Uri.basedOn returns a new absolute URI based on 
something else
	foreach(a; document.querySelectorAll("a[href]"))
		writeln(Uri(a.href).basedOn(base));
}


Snippet of the printouts:

[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]

The latter are relative links that it based on and the first few 
are absolute. Seems to have worked.


There's other kinds of links than just a[href], but fetching them 
is as simple as adding them to the selector or looping for them 
too separately:

	foreach(a; document.querySelectorAll("script[src]"))
		writeln(Uri(a.src).basedOn(base));

none on that page, no <link>s either, but it is easy enough to do 
with the lib.



Looking at the source of that page, I find some invalid HTML and 
lies about the character set. How did Document.parseGarbage do? 
Pretty well, outputting the parsed DOM tree shows it 
auto-corrected the problems I see by eye.

Jan 10 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Saturday, 10 January 2015 at 17:39:17 UTC, Adam D. Ruppe wrote:
 Though, that's still a library thing rather than a language 
 thing.

It is a language-library-platform thing, things like how 
composable the eco system is would be interesting to compare. But 
it would be unfair to require a minimalistic language to not use 
third party libraries. One should probably require that the 
library used is generic (not a spider-framework), not using FFI, 
mature and maintained?

 	document.parseGarbage(response.contentText);

         // Uri.basedOn returns a new absolute URI based on 
 something else
 	foreach(a; document.querySelectorAll("a[href]"))
 		writeln(Uri(a.href).basedOn(base));
 }

Nice and clean code; does it expand html entities ("&amp")?

The HTML5 standard has improved on HTML4 by now being explicit on 
how incorrect documents shall be interpreted in section 8.2. That 
ought to be sufficient, since that is what web browsers are 
supposed to do.

http://www.w3.org/TR/html5/syntax.html#html-parser

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 19:17:22 UTC, Ola Fosheim Grøstad 
wrote:
 Nice and clean code; does it expand html entities ("&amp")?

Of course. It does it both ways:

<span>a &amp;</span>

span.innerText == "a &"

span.innerText = "a \" b";
assert(span.innerHTML == "a &quot; b");

parseGarbage also tries to fix broken entities, so like & 
standing alone it will translate to &amp; for you. there's also 
parseStrict which just throws an exception in cases like that.

That's one thing a lot of XML parsers don't do in the name of 
speed, but I do since it is pretty rare that I don't want them 
translated. One thing I did for a speedup though was scan the 
string for & and if it doesn't find one, return a slice of the 
original, and if it does, return a new string with the entity 
translated. Gave a surprisingly big speed boost without costing 
anything in convenience.

 The HTML5 standard has improved on HTML4 by now being explicit 
 on how incorrect documents shall be interpreted in section 8.2. 
 That ought to be sufficient, since that is what web browsers 
 are supposed to do.

 http://www.w3.org/TR/html5/syntax.html#html-parser

Huh, I never read that, my thing just did what looked right to me 
over hundreds of test pages that were broken in various strange 
and bizarre ways.

Jan 10 2015

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Friday, 9 January 2015 at 17:15:43 UTC, Adam D. Ruppe wrote:
 import arsd.dom;
 import std.net.curl;
 import std.stdio, std.algorithm;

 void main() {
 	auto document = new Document(cast(string) 
 get("http://www.stroustrup.com/C++.html"));
 	writeln(document.querySelectorAll("a[href]").map!(a=>a.href));
 }

 Or perhaps better yet:

 import arsd.dom;
 import std.net.curl;
 import std.stdio;

 void main() {
 	auto document = new Document(cast(string) 
 get("http://www.stroustrup.com/C++.html"));
 	foreach(a; document.querySelectorAll("a[href]"))
 		writeln(a.href);
 }

 Which puts each one on a separate line.

Both these code examples triggers the same assert()

dmd: expression.c:3761: size_t StringExp::length(int): Assertion 
`encSize == 1 || encSize == 2 || encSize == 4' failed.

on dmd git master. Ideas anyone?

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 13:22:57 UTC, Nordlöw wrote:
 on dmd git master. Ideas anyone?

Don't use git master :P

Definitely another regression. That line was just pushed to git 
like two weeks ago and the failing assertion is pretty obviously 
a pure dmd code bug, it doesn't know the length of char 
apparently.

Jan 10 2015

"bearophile" <bearophileHUGS lycos.com> writes:

Adam D. Ruppe:

 Don't use git master :P

Is the issue in Bugzilla?

Bye,
bearophile

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 15:24:45 UTC, bearophile wrote:
 Is the issue in Bugzilla?

I don't know, bugzilla is extremely difficult to search.

I guess I'll post it again and worst case it will be closed as a 
duplicate.

Jan 10 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Saturday, 10 January 2015 at 15:24:45 UTC, bearophile wrote:
 Is the issue in Bugzilla?


https://issues.dlang.org/show_bug.cgi?id=13966

Jan 10 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 10 January 2015 at 14:56:09 UTC, Adam D. Ruppe wrote:
 On Saturday, 10 January 2015 at 13:22:57 UTC, Nordlöw wrote:
 on dmd git master. Ideas anyone?

 Don't use git master :P

Do use git master. The more people do, the fewer regressions will 
slip into the final release.

You can use Dustmite to reduce the code to a simple example, and 
Digger to find the exact pull request which introduced the 
regression. (Yes, shameless plug, preaching to the choir, etc.)

Jan 10 2015

"Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:

On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

Link to answer in D:
http://codegolf.stackexchange.com/a/44417/13362

Jan 09 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/9/15 6:10 PM, Jesse Phillips wrote:
 On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

 Link to answer in D:
 http://codegolf.stackexchange.com/a/44417/13362

Nailed it. -- Andrei

Jan 09 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 10 January 2015 at 02:10:04 UTC, Jesse Phillips 
wrote:
 On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

 Link to answer in D:
 http://codegolf.stackexchange.com/a/44417/13362

I think byLine is not necessary. By default . will not match line 
breaks.

One statement solution:

import std.net.curl, std.stdio;
import std.algorithm, std.regex;

void main() {
	get("http://www.stroustrup.com/C++.html")
	    .matchAll(`<a.*?href="(.*)"`)
	    .map!(m => m[1])
	    .each!writeln();
}

Jan 09 2015

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Vladimir Panteleev via Digitalmars-d-learn píše v So 10. 01. 2015 v
07:42 +0000:
 On Saturday, 10 January 2015 at 02:10:04 UTC, Jesse Phillips 
 wrote:
 On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

 Link to answer in D:
 http://codegolf.stackexchange.com/a/44417/13362

 
 I think byLine is not necessary. By default . will not match line 
 breaks.
 
 One statement solution:
 
 import std.net.curl, std.stdio;
 import std.algorithm, std.regex;
 
 void main() {
 	get("http://www.stroustrup.com/C++.html")
 	    .matchAll(`<a.*?href="(.*)"`)
 	    .map!(m => m[1])
 	    .each!writeln();
 }
 


Oh here is it, I was looking for each. I think it is allready in a
phobos but I can not find. Now I know why :D

Jan 10 2015

"MattCoder" <stop spam.com> writes:

On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote:
 https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro

 From the link: "Let's show Stroustrup what small and readable 
program actually is."

Alright, there are a lot a examples in many languagens, but those 
examples doesn't should handle exceptions like the original code 
does?

Matheus.

Jan 10 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - For those ready to take the challenge