digitalmars.D.learn - regarding Latin1 to UTF8 encoding

Hugo Florentino (23/23) Dec 08 2013 Hi,

Adam D. Ruppe (12/13) Dec 08 2013 Don't use readText if it isn't utf-8; readtext assumes it is utf

Hugo Florentino (11/22) Dec 08 2013 Actually, it did work, even keeping input type as auto.

Adam D. Ruppe (26/28) Dec 08 2013 UTF-8 can be detected fairly reliably, but not much luck for

Hugo Florentino (4/32) Dec 08 2013 Clever solution, thanks.

Adam D. Ruppe (20/22) Dec 08 2013 Maybe, but I don't think it would be very pretty. Really, I think

Hugo Florentino <hugo acdam.cu> writes:

Hi,

I am having some problems trygin to pass regular expressions to a 
webpage encoded in Latin1. I have unsuccessfully tried to convert it to 
UTF8 before passing the regular expression.

Initially I tried to do something like this:

auto input = readText("myfile.htm");
auto output = replace(input, re1, re2);

But I got this error when trying to run the application:
std.utf.UTFException C:\DMD2\Windows\bin\..\..\src\phobos\std\utf.d(1113): 
Invalid UTF-8 sequence (at index 1)

I then tried this, but the error remains

auto input = readText("myfile.htm");
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);

Also, this did not work:

auto input = cast(string) read("myfile.htm");
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);

core.exception.AssertError std.encoding(1995): Assertion failure

Please, any help would be appreciated.

Regards, Hugo

Dec 08 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Monday, 9 December 2013 at 02:40:29 UTC, Hugo Florentino wrote:
 auto input = readText("myfile.htm");

Don't use readText if it isn't utf-8; readtext assumes it is utf 
8.

I've never actually used std.encoding (I wrote my own encoding 
module for my dom.d, which I used for website scraping too) but I 
think this is what you want:

Latin1String input = cast(Latin1String) 
std.file.read("myfile.htm");
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);


see if that works

Dec 08 2013

Hugo Florentino <hugo acdam.cu> writes:

On Mon, 09 Dec 2013 03:44:19 +0100, Adam D. Ruppe wrote:
 On Monday, 9 December 2013 at 02:40:29 UTC, Hugo Florentino wrote:
 auto input = readText("myfile.htm");

 Don't use readText if it isn't utf-8; readtext assumes it is utf 8.

 I've never actually used std.encoding (I wrote my own encoding module
 for my dom.d, which I used for website scraping too) but I think this
 is what you want:

 Latin1String input = cast(Latin1String) std.file.read("myfile.htm");
 string buffer;
 transcode(input, buffer);
 auto output = replace(buffer, re1, re2);


 see if that works

Actually, it did work, even keeping input type as auto.
It seems the explicit typecast to Lating1String was the required 
element for it to work, which makes sense now that I think about it.

Thanks a lot for the (amazingly quick) reply ;)

Now, if I may add a closely related doubt:

Suppose "myfile.txt" was given to me daily by careless people who 
usually save it as Latin1 but from time to time might save it as UTF8.
Is there a way to detect the encoding prior to typecasting/loading the 
file?

Regards, Hugo

Dec 08 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
 Is there a way to detect the encoding prior to 
 typecasting/loading the file?

UTF-8 can be detected fairly reliably, but not much luck for 
other encodings. A Windows-1258 and a Latin1 file, for example, 
are usually fairly indistinguishable from a binary perspective - 
they use the same numbers, just for different things.

(It is possible to distinguish them if you use some context and 
grammar check kind of things, but that's not easy.)


But utf-8 has a neat feature: any non-ascii stuff needs to 
validate, and it is unlikely that random data would correctly 
validate.

std.utf.validate can do that (though it throws an exception if it 
fails, ugh!)

So here's how I did it in my own characterencodings.d:

https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138


         string utf8string;
         import std.utf;
         try {
                 validate!string(cast(string) rawdata);
                 // validation passed, assume it is UTF-8 and use 
it
                 utf8string = cast(string) rawdata;
         } catch(UTFException t) {
                // not utf-8, try latin1
                transcode(cast(Latin1String) rawData, utf8string);
         }

         // now go ahead and use utf8 string, it should be set

Dec 08 2013

Hugo Florentino <hugo acdam.cu> writes:

On Mon, 09 Dec 2013 04:19:51 +0100, Adam D. Ruppe wrote:
 On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
 Is there a way to detect the encoding prior to typecasting/loading 
 the file?

 UTF-8 can be detected fairly reliably, but not much luck for other
 encodings. A Windows-1258 and a Latin1 file, for example, are usually
 fairly indistinguishable from a binary perspective - they use the 
 same
 numbers, just for different things.

 (It is possible to distinguish them if you use some context and
 grammar check kind of things, but that's not easy.)


 But utf-8 has a neat feature: any non-ascii stuff needs to validate,
 and it is unlikely that random data would correctly validate.

 std.utf.validate can do that (though it throws an exception if it
 fails, ugh!)

 So here's how I did it in my own characterencodings.d:

 
 https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138


         string utf8string;
         import std.utf;
         try {
                 validate!string(cast(string) rawdata);
                 // validation passed, assume it is UTF-8 and use it
                 utf8string = cast(string) rawdata;
         } catch(UTFException t) {
                // not utf-8, try latin1
                transcode(cast(Latin1String) rawData, utf8string);
         }

         // now go ahead and use utf8 string, it should be set

Clever solution, thanks.
Coud this work using scope instead of try/catch?

P.S. Nice unit, by the way.

Dec 08 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Monday, 9 December 2013 at 03:33:46 UTC, Hugo Florentino wrote:
 Coud this work using scope instead of try/catch?

Maybe, but I don't think it would be very pretty. Really, I think 
validate should return a bool instead of throwing, but since it 
doesn't the try/catch is as close as it gets.

 P.S. Nice unit, by the way.

BTW if you need to parse random html, grab that file and my dom.d 
from the same repo.

auto document = new Document();
document.parseGarbage(whatever_data);

parseGarbage tries to determine the character encoding 
automatically, from the validate check or the meta tags in the 
HTML if they are there, then guessing if not. It is pretty good 
at parsing broken html tag soup to make a dom similar to the 
browser.

Then you can get data out of it doing things like

auto firstParagraph = document.querySelector("p:first-child");
if(firstParagraph is null) writeln("no first child paragraph");
else writeln("first child paragraph text: ", 
firstParagraph.innerText);

and stuff like that, if you have used Javascript before dom.d 
should look fairly familiar.

Dec 08 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - regarding Latin1 to UTF8 encoding