www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - html2txt library, anyone?

reply jicman <jicman_member pathlink.com> writes:
yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
change it to text, but has anyone written a d library to do this?

Thanks,

josť
Jan 19 2006
next sibling parent reply James Dunne <james.jdunne gmail.com> writes:
jicman wrote:
 yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
 change it to text, but has anyone written a d library to do this?
 
 Thanks,
 
 josť
 
 

I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?
Jan 21 2006
parent reply jicman <jicman_member pathlink.com> writes:
He he he he... c doesn't scare me. ;-)  Neither does c#. :-)  yes, please.  I
would love to have it.  Would you be so kind as to email it to,

cabrera
at
wrc.xerox.com

thanks.

josť

James Dunne says...
jicman wrote:
 yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
 change it to text, but has anyone written a d library to do this?
 
 Thanks,
 
 josť
 
 

I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?

Jan 21 2006
parent James Dunne <james.jdunne gmail.com> writes:
jicman wrote:
 He he he he... c doesn't scare me. ;-)  Neither does c#. :-)  yes, please.  I
 would love to have it.  Would you be so kind as to email it to,
 
 cabrera
 at
 wrc.xerox.com
 
 thanks.
 
 josť
 
 James Dunne says...
 
jicman wrote:

yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
change it to text, but has anyone written a d library to do this?

Thanks,

josť

I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?


So, keep me in suspense... -- Regards, James Dunne
Jan 27 2006
prev sibling parent reply "Charles" <noone nowhere.com> writes:
Here's a PCRE regex that will do it

"jicman" <jicman_member pathlink.com> wrote in message
news:dqpvdf$1h3j$1 digitaldaemon.com...
 yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html

 change it to text, but has anyone written a d library to do this?

 Thanks,

 josť

Jan 23 2006
parent reply "Charles" <noone nowhere.com> writes:
Oops,

char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain
text </tag> from html


"Charles" <noone nowhere.com> wrote in message
news:dr2tmr$2g9a$1 digitaldaemon.com...
 Here's a PCRE regex that will do it

 "jicman" <jicman_member pathlink.com> wrote in message
 news:dqpvdf$1h3j$1 digitaldaemon.com...
 yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html

 change it to text, but has anyone written a d library to do this?

 Thanks,

 josť


Jan 23 2006
parent reply James Dunne <james.jdunne gmail.com> writes:
Charles wrote:
 Oops,
 
 char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain
 text </tag> from html
 
 
 "Charles" <noone nowhere.com> wrote in message
 news:dr2tmr$2g9a$1 digitaldaemon.com...
 
Here's a PCRE regex that will do it

"jicman" <jicman_member pathlink.com> wrote in message
news:dqpvdf$1h3j$1 digitaldaemon.com...

yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html

file and
change it to text, but has anyone written a d library to do this?

Thanks,

josť



What about reflowing whitespace runs? BR tags to newlines, P tags, ordered lists, bulleted lists? Incorrect tag close nestings? (i.e. <i><b></i></b>) Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value... HTML/XML comments... the list never ends. This is why HTML is such a hacked standard. -- Regards, James Dunne
Jan 27 2006
parent "Charles" <noone nowhere.com> writes:
 This is why HTML is such a hacked standard.

Yea I agree . I've been using AJAX lately but its hard for me to get over how 'hackish' it is , jumping through tons of hurdles just to overcome the limitations of HTTP/HTML. Have you seen HTML 2.0 ? http://www.w3.org/MarkUp/html-spec/html-spec_toc.html . I'd love to see a new design language for the web , with some better widgets and connection based . Using Mango for the server and the Harmonia code base to display this unnamed new language :D. "James Dunne" <james.jdunne gmail.com> wrote in message news:drf2b5$gb9$1 digitaldaemon.com...
 Charles wrote:
 Oops,

 char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag>


 text </tag> from html


 "Charles" <noone nowhere.com> wrote in message
 news:dr2tmr$2g9a$1 digitaldaemon.com...

Here's a PCRE regex that will do it

"jicman" <jicman_member pathlink.com> wrote in message
news:dqpvdf$1h3j$1 digitaldaemon.com...

yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html

file and
change it to text, but has anyone written a d library to do this?

Thanks,

josť



What about reflowing whitespace runs? BR tags to newlines, P tags, ordered lists, bulleted lists? Incorrect tag close nestings? (i.e. <i><b></i></b>) Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value... HTML/XML comments... the list never ends. This is why HTML is such a hacked standard. -- Regards, James Dunne

Jan 29 2006