www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Best way to read/write Chinese (GBK/GB18030) files?

reply John Xu <728308756 qq.com> writes:
I'm new to dlang. I didn't find much tutorials on internet about 
how to read/write Chinese easily. std.encoding doesn't seem to 
support GBK or GB18030:

"Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, 
ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), 
WINDOWS-1250, WINDOWS-1251 and WINDOWS-1252."

Then what is best way to read GBK/GB18030 contents ? Even 
GBK/GB18030 file names ?
Mar 06 2023
next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 3/6/23 8:45 PM, John Xu wrote:
 I'm new to dlang. I didn't find much tutorials on internet about how to 
 read/write Chinese easily. std.encoding doesn't seem to support GBK or 
 GB18030:
 
 "Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, 
 ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, 
 WINDOWS-1251 and WINDOWS-1252."
It appears that encoding is not supported. There is a scant mention of it, in the BOM detection. But I don't think there's any mechanism to encode/decode it.
 
 Then what is best way to read GBK/GB18030 contents ? Even GBK/GB18030 
 file names ?
 
 
D has direct bindings to C, so possibly using a C library. I don't see anything jumping out at me from code.dlang.org -Steve
Mar 06 2023
prev sibling parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
On Tuesday, 7 March 2023 at 01:45:27 UTC, John Xu wrote:
 I'm new to dlang. I didn't find much tutorials on internet 
 about how to read/write Chinese easily. std.encoding doesn't 
 seem to support GBK or GB18030:

 "Encodings currently supported are UTF-8, UTF-16, UTF-32, 
 ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 
 (LATIN-2), WINDOWS-1250, WINDOWS-1251 and WINDOWS-1252."

 Then what is best way to read GBK/GB18030 contents ? Even 
 GBK/GB18030 file names ?
I found this: https://github.com/meatatt/exCode/blob/master/source/excode/package.d There is mention of unicode/GBK conversion, maybe it could be helpful
Mar 06 2023
parent reply John Xu <728308756 qq.com> writes:
 I found this: 
 https://github.com/meatatt/exCode/blob/master/source/excode/package.d

 There is mention of unicode/GBK conversion, maybe it could be 
 helpful
Thanks for quick answers. Now I found I can read both UTF8 and UTF-16LE chinese file: string txt = std.file.read(chineseFile).to!string; and write to UTF8 file: std.file.write(utf8ChineseFile, txt); But still need figure out how to read/write GBK directly.
Mar 09 2023
parent reply zjh <fqbqrr 163.com> writes:
On Friday, 10 March 2023 at 02:48:43 UTC, John Xu wrote:


```d
module chinese;
import std.stdio : writeln;
import std.conv;
import std.windows.charset;

int main(string[] argv)
{
	auto s1 = "中文";//utf8 字符串
	writeln("word:"~ s1); //乱的
	writeln("word:" ~ to!string(toMBSz(text(s1)))); //转后就正常了
     writeln("Hello D-World!");
     return 0;
}
```
Mar 09 2023
parent reply zjh <fqbqrr 163.com> writes:
On Friday, 10 March 2023 at 06:19:38 UTC, zjh wrote:


`D language` is too unfriendly for Chinese users!
You can't even write `gbk` files.
Mar 09 2023
parent reply 0xEAB <desisma heidel.beer> writes:
On Friday, 10 March 2023 at 07:16:32 UTC, zjh wrote:
 `D language` is too unfriendly for Chinese users!
 You can't even write `gbk` files.
D’s char + string types are Unicode. To quote the tour, “In D, *all* strings are Unicode strings”. If you desire to use other encodings, how about using ubyte + ubyte[]?
Mar 11 2023
parent reply zjh <fqbqrr 163.com> writes:
On Saturday, 11 March 2023 at 19:56:09 UTC, 0xEAB wrote:

 If you desire to use other encodings, how about using ubyte + 
 ubyte[]?
There is no example. An example should be added in an obvious position. I tried for a long time, but couldn't output `gbk`, and I finally gave up.
Mar 11 2023
parent reply 0xEAB <desisma heidel.beer> writes:
On Sunday, 12 March 2023 at 00:54:53 UTC, zjh wrote:
 On Saturday, 11 March 2023 at 19:56:09 UTC, 0xEAB wrote:

 If you desire to use other encodings, how about using ubyte + 
 ubyte[]?
There is no example.
To read binary data from a file and dump it into another, you do: ```d import std.file : read, write; void[] data = read("infile.txt"); write("outfile.txt", data); ``` To write binary data to a file: ```d import std.file : write; ubyte[] data = [0xA0, 0x0A, 0x30, 0x01, 0xFF, 0x00, 0xFE]; write("myfile.txt", data); ``` `data` could contain GBK encoded text, for example. (Just don’t use `"Unicode literals"`.)
Mar 12 2023
parent reply zjh <fqbqrr 163.com> writes:
On Sunday, 12 March 2023 at 20:03:23 UTC, 0xEAB wrote:
 ...
Thank you for your reply, but is there any way to output `gbk` code to the console?
Mar 12 2023
next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 3/12/23 8:32 PM, zjh wrote:
 On Sunday, 12 March 2023 at 20:03:23 UTC, 0xEAB wrote:
 ...
Thank you for your reply, but is there any way to output `gbk` code to the console?
What is required is an addition to the `std.encoding` module, to allow such an encoding. Encodings are simply translating some encoding (e.g. utf) to another (e.g. gbk). If you look at `std.encoding` you can get an idea of what it might require. It will take some effort and especially some help from a knowledgeable user (such as yourself). -Steve
Mar 13 2023
parent zjh <fqbqrr 163.com> writes:
On Monday, 13 March 2023 at 15:50:37 UTC, Steven Schveighoffer 
wrote:

 What is required is an addition to the `std.encoding` module, 
 to allow such an encoding.
Thank you for your information.
Mar 13 2023
prev sibling parent reply Kagamin <spam here.lot> writes:
On Monday, 13 March 2023 at 00:32:07 UTC, zjh wrote:
 Thank you for your reply, but is there any way to output `gbk` 
 code to the console?
I guess if your console is in gbk encoding, you can just write bytes with stdout.write.
Mar 14 2023
parent reply zjh <fqbqrr 163.com> writes:
On Tuesday, 14 March 2023 at 09:20:54 UTC, Kagamin wrote:

 I guess if your console is in gbk encoding, you can just write 
 bytes with stdout.write.
Thank you for your reply, but only display bytes, not gbk text.
Mar 14 2023
parent reply Kagamin <spam here.lot> writes:
https://dlang.org/phobos/std_stdio.html#rawWrite
Mar 22 2023
parent zjh <fqbqrr 163.com> writes:
On Wednesday, 22 March 2023 at 15:23:42 UTC, Kagamin wrote:
 https://dlang.org/phobos/std_stdio.html#rawWrite
It's really amazing, it succeeded. Thank you! ```cpp auto b="test.txt";//gbk void[]d=read(b); stdout.rawWrite(d); ```
Mar 22 2023