digitalmars.D.learn - How can I convert a file encode by CP936 to a file with UTF-8 encoding

rocex (2/2) Jul 13 2022 How can I convert a file encode by CP936 to a file with UTF-8

Adam D Ruppe (12/14) Jul 13 2022 My lib doesn't have it included but the basic idea is to take

rocex (3/18) Jul 13 2022 I found this https://github.com/guotie/gogb2312, the algorithm

rocex <rocexwang gmail.com> writes:

How can I convert a file encode by CP936 to a file with UTF-8 
encoding

Jul 13 2022

Adam D Ruppe <destructionator gmail.com> writes:

On Wednesday, 13 July 2022 at 11:47:56 UTC, rocex wrote:
 How can I convert a file encode by CP936 to a file with UTF-8 
 encoding

My lib doesn't have it included but the basic idea is to take 
this table:

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT

and do the conversions. So loop through it, if it is < 128, it 
stays the same, if it == 128 it is 0x20AC, and greater than that 
you need to read the second byte too and look it up in that table.

It looks like for many of the bytes, they increase in sequence, 
so you might only need part of the actual lookup table, and the 
rest you can do with some addition. Looks like from lead byte 83 
it is a.... almost sequential offset. Probably safest to just 
copy the whole table.

Jul 13 2022

rocex <rocexwang gmail.com> writes:

On Wednesday, 13 July 2022 at 12:00:43 UTC, Adam D Ruppe wrote:
 On Wednesday, 13 July 2022 at 11:47:56 UTC, rocex wrote:
 How can I convert a file encode by CP936 to a file with UTF-8 
 encoding

 My lib doesn't have it included but the basic idea is to take 
 this table:

 https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT

 and do the conversions. So loop through it, if it is < 128, it 
 stays the same, if it == 128 it is 0x20AC, and greater than 
 that you need to read the second byte too and look it up in 
 that table.

 It looks like for many of the bytes, they increase in sequence, 
 so you might only need part of the actual lookup table, and the 
 rest you can do with some addition. Looks like from lead byte 
 83 it is a.... almost sequential offset. Probably safest to 
 just copy the whole table.

I found this https://github.com/guotie/gogb2312, the algorithm 
should be the same

Jul 13 2022

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How can I convert a file encode by CP936 to a file with UTF-8 encoding