www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How can I convert a file encode by CP936 to a file with UTF-8 encoding

reply rocex <rocexwang gmail.com> writes:
How can I convert a file encode by CP936 to a file with UTF-8 
encoding
Jul 13 2022
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Wednesday, 13 July 2022 at 11:47:56 UTC, rocex wrote:
 How can I convert a file encode by CP936 to a file with UTF-8 
 encoding
My lib doesn't have it included but the basic idea is to take this table: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT and do the conversions. So loop through it, if it is < 128, it stays the same, if it == 128 it is 0x20AC, and greater than that you need to read the second byte too and look it up in that table. It looks like for many of the bytes, they increase in sequence, so you might only need part of the actual lookup table, and the rest you can do with some addition. Looks like from lead byte 83 it is a.... almost sequential offset. Probably safest to just copy the whole table.
Jul 13 2022
parent rocex <rocexwang gmail.com> writes:
On Wednesday, 13 July 2022 at 12:00:43 UTC, Adam D Ruppe wrote:
 On Wednesday, 13 July 2022 at 11:47:56 UTC, rocex wrote:
 How can I convert a file encode by CP936 to a file with UTF-8 
 encoding
My lib doesn't have it included but the basic idea is to take this table: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT and do the conversions. So loop through it, if it is < 128, it stays the same, if it == 128 it is 0x20AC, and greater than that you need to read the second byte too and look it up in that table. It looks like for many of the bytes, they increase in sequence, so you might only need part of the actual lookup table, and the rest you can do with some addition. Looks like from lead byte 83 it is a.... almost sequential offset. Probably safest to just copy the whole table.
I found this https://github.com/guotie/gogb2312, the algorithm should be the same
Jul 13 2022