www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - basic question about text file encodings

reply "Laeeth Isharc" <nospamlaeeth nospam.laeeth.com> writes:
What is the best way to figure out and then decode a file of 
unknown coding to dchar?

2½% Index-linked Treasury Stock 

Eg I am not sure what encoding the 1/2 is in above, but treating 
it as utf8 leads to an exception.

I will post the answer to the wiki, as it's the typical kind of 
thing that stumps a newcomer to the language.

I can see how to figure out a guess at the file format from 
Tango, but what do I do with the bytes from std.file.read once I 
have guessed the encoding?

And more generally shouldn't we have an easy solution in Phobos 
for this common problem (presuming I am not just overlooking what 
is there?)
Apr 16 2015
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 16 April 2015 at 19:22:41 UTC, Laeeth Isharc wrote:
 What is the best way to figure out and then decode a file of 
 unknown coding to dchar?
You generally can't, though some statistical analysis can sometimes help. The encoding needs to be known through some other means to have a correct conversion. How was the file generated? If it came from Excel it might be in the Windows encoding. You can try my characterencodings.d https://github.com/adamdruppe/arsd/blob/master/characterencodings.d this is a standalone file, just download it and add to your build, and do string utf8 = convertToUtf8Lossy(your_data, "windows-1252"); and it will work, though it might drop a character if it doesn't know how to convert it (hence Lossy in the name). There's also a `convertToUtf8` function which never drops characters it doesn't know. Then examine the string and see if it looks right o you. Alternatively, with Phobos only, you can try: import std.conv, std.encoding; string utf8 = to!string(Windows1252String(your_data)); both my module and the Phobos module expects your input data to be immutable(ubyte)[], so you might need to cast to that. The Phobos moduel is great if you know the type at compile time and it is one of the few encodings it supports. My module is a bit better taking random runtime data (I wrote it to support website and email screen scraping).
Apr 16 2015
prev sibling parent "Kagamin" <spam here.lot> writes:
First try utf-8, if it doesn't work, then use some fallback 
encoding like latin1.
Apr 16 2015