digitalmars.D - htmlget.d example and unicode parsing
- Tyro[a.c.edwards] (23/23) Apr 30 2011 Hello all,
- Nick Sabalausky (27/48) May 01 2011 Depends on what exactly you're doing. There are many cases where indexin...
Hello all,
I am trying to learn how to parse, modify, and redisplay a Japanese
webpage passed to me in a form and am wondering if anyone has an example
of how to do this.
I looked at htmlget and found that it has a couple problems: namely, it
is not conform to current D2 practices. I am not sure that my hack can
be considered a fix but have attached it nonetheless. It now works
correctly on ascii based urls but not utf-8.
My lack of knowledge on how to properly parsing unicode documents has
left me stumped. I am therefore requesting some assistance in updating
the code such that it works with any url. I have taken a look at std.utf
and there are a few things there that could possibly assist me however
without examples I'm somewhat at a loss.
I'm assuming that the problem exists here:
for (iw = 0; iw != line.length; iw++)
{
if (!icmp("</html>", line[iw .. line.length]))
break print_lines;
}
From what I understanding, one cannot index a utf sequence the same as
you index ASCII. What is the proper what to rewrite this such that it
parses the utf characters correctly? And example would do wonders.
Thanks
Apr 30 2011
"Tyro[a.c.edwards]" <nospam home.com> wrote in message
news:ipinj3$1c77$1 digitalmars.com...
Hello all,
I am trying to learn how to parse, modify, and redisplay a Japanese
webpage passed to me in a form and am wondering if anyone has an example
of how to do this.
I looked at htmlget and found that it has a couple problems: namely, it
is not conform to current D2 practices. I am not sure that my hack can
be considered a fix but have attached it nonetheless. It now works
correctly on ascii based urls but not utf-8.
My lack of knowledge on how to properly parsing unicode documents has
left me stumped. I am therefore requesting some assistance in updating
the code such that it works with any url. I have taken a look at std.utf
and there are a few things there that could possibly assist me however
without examples I'm somewhat at a loss.
I'm assuming that the problem exists here:
for (iw = 0; iw != line.length; iw++)
{
if (!icmp("</html>", line[iw .. line.length]))
break print_lines;
}
From what I understanding, one cannot index a utf sequence the same as
you index ASCII.
Depends on what exactly you're doing. There are many cases where indexing
utf like ASCII works fine, and your code above looks like one of the cases
where it should work (Unless icmp throws or asserts on invalid code-unit
sequences. Anyone know offhand if it does?).
But you do have a non-utf-related bug in that loop. If there's anything in
'line' after the "</html>" tag, then it won't detect the tag because you're
slicing with the length of 'line' instead of the length of "</html>".
So it should be:
for (iw = 0; iw != line.length; iw++)
{
immutable endTag = "</html>";
if (line.length >= endTag.length && !icmp(endTag, line[iw ..
endTag.length]))
break print_lines;
}
On the topic of unicode, this is a really good introduction to the details
of it:
http://www.joelonsoftware.com/articles/Unicode.html
But once you read that, keep in mind there's a few important details he
failed to mention: A code-point is made up of code-units, yes, but a single
code-point is *not* always an entire character (aka "grapheme"). Because of
combining codes, a character could be made up of multiple code points (just
like how a code point can be made up of multiple code units). Also, there
are certain characters that can be represented with more than one specific
sequence of code points (and that gets into unicode normalization).
May 01 2011








"Nick Sabalausky" <a a.a>