www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - htmlget.d example and unicode parsing

reply "Tyro[a.c.edwards]" <nospam home.com> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hello all,

I am trying to learn how to parse, modify, and redisplay a Japanese 
webpage passed to me in a form and am wondering if anyone has an example 
of how to do this.

I looked at htmlget and found that it has a couple problems: namely, it 
is not conform to current D2 practices. I am not sure that my hack can 
be considered a fix but have attached it nonetheless. It now works 
correctly on ascii based urls but not utf-8.

My lack of knowledge on how to properly parsing unicode documents has 
left me stumped. I am therefore requesting some assistance in updating 
the code such that it works with any url. I have taken a look at std.utf 
and there are a few things there that could possibly assist me however 
without examples I'm somewhat at a loss.

I'm assuming that the problem exists here:

	for (iw = 0; iw != line.length; iw++)
         {
             if (!icmp("</html>", line[iw .. line.length]))
                 break print_lines;
         }

 From what I understanding, one cannot index a utf sequence the same as 
you index ASCII. What is the proper what to rewrite this such that it 
parses the utf characters correctly? And example would do wonders.

Thanks
Apr 30 2011
parent "Nick Sabalausky" <a a.a> writes:
"Tyro[a.c.edwards]" <nospam home.com> wrote in message 
news:ipinj3$1c77$1 digitalmars.com...
 Hello all,

 I am trying to learn how to parse, modify, and redisplay a Japanese
 webpage passed to me in a form and am wondering if anyone has an example
 of how to do this.

 I looked at htmlget and found that it has a couple problems: namely, it
 is not conform to current D2 practices. I am not sure that my hack can
 be considered a fix but have attached it nonetheless. It now works
 correctly on ascii based urls but not utf-8.

 My lack of knowledge on how to properly parsing unicode documents has
 left me stumped. I am therefore requesting some assistance in updating
 the code such that it works with any url. I have taken a look at std.utf
 and there are a few things there that could possibly assist me however
 without examples I'm somewhat at a loss.

 I'm assuming that the problem exists here:

 for (iw = 0; iw != line.length; iw++)
         {
             if (!icmp("</html>", line[iw .. line.length]))
                 break print_lines;
         }

 From what I understanding, one cannot index a utf sequence the same as
 you index ASCII.

Depends on what exactly you're doing. There are many cases where indexing utf like ASCII works fine, and your code above looks like one of the cases where it should work (Unless icmp throws or asserts on invalid code-unit sequences. Anyone know offhand if it does?). But you do have a non-utf-related bug in that loop. If there's anything in 'line' after the "</html>" tag, then it won't detect the tag because you're slicing with the length of 'line' instead of the length of "</html>". So it should be: for (iw = 0; iw != line.length; iw++) { immutable endTag = "</html>"; if (line.length >= endTag.length && !icmp(endTag, line[iw .. endTag.length])) break print_lines; } On the topic of unicode, this is a really good introduction to the details of it: http://www.joelonsoftware.com/articles/Unicode.html But once you read that, keep in mind there's a few important details he failed to mention: A code-point is made up of code-units, yes, but a single code-point is *not* always an entire character (aka "grapheme"). Because of combining codes, a character could be made up of multiple code points (just like how a code point can be made up of multiple code units). Also, there are certain characters that can be represented with more than one specific sequence of code points (and that gets into unicode normalization).
May 01 2011