www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Decoding HTML escape sequences

reply Hugo Florentino via Digitalmars-d-learn writes:
Hi, I have some documents where some strings appears in HTML escape 
sequences in one of these forms:

\x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e

%3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e

And I would like to recode them to readable form:

<SCRIPT LANGUAGE="Javascript">

I tried something like this, using regular expressions and the uri 
module:


import std.stdio, std.file, std.encoding, std.string, std.regex, 
std.uri;

static auto re = regex(`(%[a-fA-F0-9]{2})`);

int main(in string[] args)
{
   if (args.length < 2)
   {
     writeln("Usage: unescape file1.htm > file2.htm");
     return -1;
   }
   auto input = cast(Latin1String) read(args[1]);
   string buffer;
   transcode(input, buffer);

   string output;
   foreach(m; matchAll(buffer, re)) output ~= decode(m.hit);

   writeln(output);

   return 0;
}


Unfortunately it doesn't seem to work 100%.

I would appreciate any suggestion.

Regards, Hugo
May 12 2014
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
You should use decodeComponent instead of decode in your matchAll 
loop.

IMO encodeComponent and decodeComponent are the only two useful 
uri encode functions (btw same in JS, use decodeURIComponent 
instead of the other functions). The other ones have weird rules.
May 12 2014