www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - encoding ISO-8859-1 to UTF-8 in std.net.curl

reply Alexsej <lexa-skripa rambler.ru> writes:
import std.stdio;
import std.net.curl;

void main()
{

	string url = "www.site.ru/xml/api.asp";

	string data =
	"<?xml version='1.0' encoding='UTF-8'?>
		<request>
		<category>
			<id>59538</id>
		</category>
                 ...
		</request>";

	auto http = HTTP();
	http.clearRequestHeaders();
	http.addRequestHeader("Content-Type", "application/xml");
	//Accept-Charset: utf-8
	http.addRequestHeader("Accept-Charset", "utf-8");
	
	//ISO-8859-1
	//http://www.artlebedev.ru/tools/decoder/
	//ISO-8859-1 → UTF-8
	auto content = post(url, "data", http);
	// content in ISO-8859-1 to UTF-8 encoding but I lose
         //the Cyrillic "<?xml version='1.0' 
encoding='UTF-8'?>отсутствует или неверно задан
параметр"
	// I get it "<?xml version='1.0' 
encoding='UTF-8'?>отсутствует или
неверно 
задан параметр"
	// How do I change the encoding to UTF-8 in response


	string s = cast(immutable char[])content;
	auto f = File("output.txt","w");  // output.txt file in UTF-8;
	f.write(s);
	f.close;
}
Aug 08 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 08/08/2016 09:57 PM, Alexsej wrote:
     // content in ISO-8859-1 to UTF-8 encoding but I lose
         //the Cyrillic "<?xml version='1.0'
 encoding='UTF-8'?>отсутствует или неверно задан
параметр"
     // I get it "<?xml version='1.0'
 encoding='UTF-8'?>отсутствует или
неверно
 задан параметр"
     // How do I change the encoding to UTF-8 in response


     string s = cast(immutable char[])content;
     auto f = File("output.txt","w");  // output.txt file in UTF-8;
     f.write(s);
The server doesn't include the encoding in the Content-Type header, right? So curl assumes the default, which is ISO 8859-1. It interprets the data as that and transcodes to UTF-8. The result is garbage, of course. I don't see a way to change the default encoding. Maybe that should be added. Until then you can reverse the wrong transcoding: ---- import std.encoding: Latin1String, transcode; Latin1String pseudo_latin1; transcode(content.idup, pseudo_latin1); string s = cast(string) pseudo_latin1; ---- Tiny rant: Why on earth does transcode only accept immutable characters for input? Every other post here uncovers some bug/shortcoming :(
Aug 08 2016
next sibling parent reply Alexsej <lexa-skripa rambler.ru> writes:
On Monday, 8 August 2016 at 21:11:26 UTC, ag0aep6g wrote:
 On 08/08/2016 09:57 PM, Alexsej wrote:
     // content in ISO-8859-1 to UTF-8 encoding but I lose
         //the Cyrillic "<?xml version='1.0'
 encoding='UTF-8'?>отсутствует или неверно задан
параметр"
     // I get it "<?xml version='1.0'
 encoding='UTF-8'?>отсутствует или
неверно
 задан параметр"
     // How do I change the encoding to UTF-8 in response


     string s = cast(immutable char[])content;
     auto f = File("output.txt","w");  // output.txt file in 
 UTF-8;
     f.write(s);
The server doesn't include the encoding in the Content-Type header, right? So curl assumes the default, which is ISO 8859-1. It interprets the data as that and transcodes to UTF-8. The result is garbage, of course. I don't see a way to change the default encoding. Maybe that should be added. Until then you can reverse the wrong transcoding: ---- import std.encoding: Latin1String, transcode; Latin1String pseudo_latin1; transcode(content.idup, pseudo_latin1); string s = cast(string) pseudo_latin1; ---- Tiny rant: Why on earth does transcode only accept immutable characters for input? Every other post here uncovers some bug/shortcoming :(
//header from server server: nginx date: Mon, 08 Aug 2016 22:02:15 GMT content-type: text/xml; Charset=utf-8 content-length: 204 connection: keep-alive vary: Accept-Encoding cache-control: private expires: Mon, 08 Aug 2016 22:02:15 GMT set-cookie: ASPSESSIONIDSSCCDASA=KIAPMCMDMPEDHPBJNMGFHMEB; path=/ x-powered-by: ASP.NET
Aug 08 2016
parent ag0aep6g <anonymous example.com> writes:
On 08/09/2016 12:05 AM, Alexsej wrote:
 //header from server
 server: nginx
 date: Mon, 08 Aug 2016 22:02:15 GMT
 content-type: text/xml; Charset=utf-8
 content-length: 204
 connection: keep-alive
 vary: Accept-Encoding
 cache-control: private
 expires: Mon, 08 Aug 2016 22:02:15 GMT
 set-cookie: ASPSESSIONIDSSCCDASA=KIAPMCMDMPEDHPBJNMGFHMEB; path=/
 x-powered-by: ASP.NET
Looks like std.net.curl doesn't handle "Charset" correctly. It only works with lowercase "charset". https://github.com/dlang/phobos/pull/4723
Aug 08 2016
prev sibling next sibling parent ag0aep6g <anonymous example.com> writes:
On 08/08/2016 11:11 PM, ag0aep6g wrote:
 Why on earth does transcode only accept immutable characters for input?
https://github.com/dlang/phobos/pull/4722
Aug 08 2016
prev sibling parent Alexsej <lexa-skripa rambler.ru> writes:
On Monday, 8 August 2016 at 21:11:26 UTC, ag0aep6g wrote:
 On 08/08/2016 09:57 PM, Alexsej wrote:
     // content in ISO-8859-1 to UTF-8 encoding but I lose
         //the Cyrillic "<?xml version='1.0'
 encoding='UTF-8'?>отсутствует или неверно задан
параметр"
     // I get it "<?xml version='1.0'
 encoding='UTF-8'?>отсутствует или
неверно
 задан параметр"
     // How do I change the encoding to UTF-8 in response


     string s = cast(immutable char[])content;
     auto f = File("output.txt","w");  // output.txt file in 
 UTF-8;
     f.write(s);
The server doesn't include the encoding in the Content-Type header, right? So curl assumes the default, which is ISO 8859-1. It interprets the data as that and transcodes to UTF-8. The result is garbage, of course. I don't see a way to change the default encoding. Maybe that should be added. Until then you can reverse the wrong transcoding: ---- import std.encoding: Latin1String, transcode; Latin1String pseudo_latin1; transcode(content.idup, pseudo_latin1); string s = cast(string) pseudo_latin1; ---- Tiny rant: Why on earth does transcode only accept immutable characters for input? Every other post here uncovers some bug/shortcoming :(
thanks it works.
Aug 08 2016