www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - regarding Latin1 to UTF8 encoding

reply Hugo Florentino <hugo acdam.cu> writes:
Hi,

I am having some problems trygin to pass regular expressions to a 
webpage encoded in Latin1. I have unsuccessfully tried to convert it to 
UTF8 before passing the regular expression.

Initially I tried to do something like this:

auto input = readText("myfile.htm");
auto output = replace(input, re1, re2);

But I got this error when trying to run the application:
std.utf.UTFException C:\DMD2\Windows\bin\..\..\src\phobos\std\utf.d(1113): 
Invalid UTF-8 sequence (at index 1)

I then tried this, but the error remains

auto input = readText("myfile.htm");
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);

Also, this did not work:

auto input = cast(string) read("myfile.htm");
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);

core.exception.AssertError std.encoding(1995): Assertion failure

Please, any help would be appreciated.

Regards, Hugo
Dec 08 2013
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 9 December 2013 at 02:40:29 UTC, Hugo Florentino wrote:
 auto input = readText("myfile.htm");
Don't use readText if it isn't utf-8; readtext assumes it is utf 8. I've never actually used std.encoding (I wrote my own encoding module for my dom.d, which I used for website scraping too) but I think this is what you want: Latin1String input = cast(Latin1String) std.file.read("myfile.htm"); string buffer; transcode(input, buffer); auto output = replace(buffer, re1, re2); see if that works
Dec 08 2013
parent reply Hugo Florentino <hugo acdam.cu> writes:
On Mon, 09 Dec 2013 03:44:19 +0100, Adam D. Ruppe wrote:
 On Monday, 9 December 2013 at 02:40:29 UTC, Hugo Florentino wrote:
 auto input = readText("myfile.htm");
Don't use readText if it isn't utf-8; readtext assumes it is utf 8. I've never actually used std.encoding (I wrote my own encoding module for my dom.d, which I used for website scraping too) but I think this is what you want: Latin1String input = cast(Latin1String) std.file.read("myfile.htm"); string buffer; transcode(input, buffer); auto output = replace(buffer, re1, re2); see if that works
Actually, it did work, even keeping input type as auto. It seems the explicit typecast to Lating1String was the required element for it to work, which makes sense now that I think about it. Thanks a lot for the (amazingly quick) reply ;) Now, if I may add a closely related doubt: Suppose "myfile.txt" was given to me daily by careless people who usually save it as Latin1 but from time to time might save it as UTF8. Is there a way to detect the encoding prior to typecasting/loading the file? Regards, Hugo
Dec 08 2013
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
 Is there a way to detect the encoding prior to 
 typecasting/loading the file?
UTF-8 can be detected fairly reliably, but not much luck for other encodings. A Windows-1258 and a Latin1 file, for example, are usually fairly indistinguishable from a binary perspective - they use the same numbers, just for different things. (It is possible to distinguish them if you use some context and grammar check kind of things, but that's not easy.) But utf-8 has a neat feature: any non-ascii stuff needs to validate, and it is unlikely that random data would correctly validate. std.utf.validate can do that (though it throws an exception if it fails, ugh!) So here's how I did it in my own characterencodings.d: https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138 string utf8string; import std.utf; try { validate!string(cast(string) rawdata); // validation passed, assume it is UTF-8 and use it utf8string = cast(string) rawdata; } catch(UTFException t) { // not utf-8, try latin1 transcode(cast(Latin1String) rawData, utf8string); } // now go ahead and use utf8 string, it should be set
Dec 08 2013
parent reply Hugo Florentino <hugo acdam.cu> writes:
On Mon, 09 Dec 2013 04:19:51 +0100, Adam D. Ruppe wrote:
 On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
 Is there a way to detect the encoding prior to typecasting/loading 
 the file?
UTF-8 can be detected fairly reliably, but not much luck for other encodings. A Windows-1258 and a Latin1 file, for example, are usually fairly indistinguishable from a binary perspective - they use the same numbers, just for different things. (It is possible to distinguish them if you use some context and grammar check kind of things, but that's not easy.) But utf-8 has a neat feature: any non-ascii stuff needs to validate, and it is unlikely that random data would correctly validate. std.utf.validate can do that (though it throws an exception if it fails, ugh!) So here's how I did it in my own characterencodings.d: https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138 string utf8string; import std.utf; try { validate!string(cast(string) rawdata); // validation passed, assume it is UTF-8 and use it utf8string = cast(string) rawdata; } catch(UTFException t) { // not utf-8, try latin1 transcode(cast(Latin1String) rawData, utf8string); } // now go ahead and use utf8 string, it should be set
Clever solution, thanks. Coud this work using scope instead of try/catch? P.S. Nice unit, by the way.
Dec 08 2013
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 9 December 2013 at 03:33:46 UTC, Hugo Florentino wrote:
 Coud this work using scope instead of try/catch?
Maybe, but I don't think it would be very pretty. Really, I think validate should return a bool instead of throwing, but since it doesn't the try/catch is as close as it gets.
 P.S. Nice unit, by the way.
BTW if you need to parse random html, grab that file and my dom.d from the same repo. auto document = new Document(); document.parseGarbage(whatever_data); parseGarbage tries to determine the character encoding automatically, from the validate check or the meta tags in the HTML if they are there, then guessing if not. It is pretty good at parsing broken html tag soup to make a dom similar to the browser. Then you can get data out of it doing things like auto firstParagraph = document.querySelector("p:first-child"); if(firstParagraph is null) writeln("no first child paragraph"); else writeln("first child paragraph text: ", firstParagraph.innerText); and stuff like that, if you have used Javascript before dom.d should look fairly familiar.
Dec 08 2013