digitalmars.D.learn - How to know whether a file's encoding is ansi or utf8?
- Sam Hu (6/6) Jul 22 2014 Greetings!
- Sam Hu (4/10) Jul 22 2014 Sorry,I mean by by code,for example,when I try to read a file
- FreeSlave (6/12) Jul 22 2014 By ANSI do you mean Windows code pages? Text editors usually use
- Sam Hu (11/26) Jul 22 2014 Thanks.
- Alexandre (37/43) Jul 22 2014 Read the BOM ?
- Sam Hu (2/48) Jul 22 2014 Thanks. This is exactly what I want at this moment.
- Kagamin (3/3) Jul 24 2014 I first try to load the file as utf8 (or some 8kb at the start of
Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, Sam
Jul 22 2014
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, SamSorry,I mean by by code,for example,when I try to read a file content and printed to a text control in GUI,or to console,will proceed differently regarding file encoding.
Jul 22 2014
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, SamBy ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
Jul 22 2014
On Tuesday, 22 July 2014 at 11:09:36 UTC, FreeSlave wrote:On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:Thanks. Yes.It is Windows related again...I found that writefln() can print ansi encoding files into console and shows its content correctly under asia font environment,but this does not work for files with UTF8 encoding;On the other hand,Tango 4 D2 branch can print files with UTF8 encoding into console and shows its content correctly under asia font environment.I tried a 'both-way' with Tango but failed.So I just have a silly idea when I encountered a file to be printed to the console,I choose writefln or Tango's Stdout.formatln depending on the file encoding.Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, SamBy ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
Jul 22 2014
Read the BOM ? module main; import std.stdio; enum Encoding { UTF7, UTF8, UTF32, Unicode, BigEndianUnicode, ASCII }; Encoding GetFileEncoding(string fileName) { import std.file; auto bom = cast(ubyte[]) read(fileName, 4); if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7; if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8; if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32; return Encoding.ASCII; } void main(string[] args) { if(GetFileEncoding("test.txt") == Encoding.UTF8) writeln("The file is UTF8"); else writeln("File is not UTF8 :("); } On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, Sam
Jul 22 2014
On Tuesday, 22 July 2014 at 11:59:34 UTC, Alexandre wrote:Read the BOM ? module main; import std.stdio; enum Encoding { UTF7, UTF8, UTF32, Unicode, BigEndianUnicode, ASCII }; Encoding GetFileEncoding(string fileName) { import std.file; auto bom = cast(ubyte[]) read(fileName, 4); if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7; if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8; if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32; return Encoding.ASCII; } void main(string[] args) { if(GetFileEncoding("test.txt") == Encoding.UTF8) writeln("The file is UTF8"); else writeln("File is not UTF8 :("); } On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:Thanks. This is exactly what I want at this moment.Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, Sam
Jul 22 2014
Note that BOMs are optional and may be not presented in Unicode file. Also presence of leading bytes which look BOM does not necessarily mean that file is encoded in some kind of Unicode.
Jul 22 2014
http://www.architectshack.com/TextFileEncodingDetector.ashx On Tuesday, 22 July 2014 at 15:53:23 UTC, FreeSlave wrote:Note that BOMs are optional and may be not presented in Unicode file. Also presence of leading bytes which look BOM does not necessarily mean that file is encoded in some kind of Unicode.There are several difficulties in this case ...
Jul 22 2014
I first try to load the file as utf8 (or some 8kb at the start of it) with encoding exceptions turned on, if I catch an exception, I reload it as ansi, otherwise I assume it's valid utf8.
Jul 24 2014