digitalmars.D - print non-ASCII/UTF-8 string
- Egor Starostin <egorst gmail.com> Dec 22 2006
- Pragma <ericanderton yahoo.removeme.com> Dec 22 2006
- Egor Starostin <egorst gmail.com> Dec 22 2006
- "Jarrett Billingsley" <kb3ctd2 yahoo.com> Dec 22 2006
- Thomas Kuehne <thomas-dloop kuehne.cn> Dec 22 2006
- BCS <BCS pathilink.com> Dec 22 2006
- Bruno Medeiros <brunodomedeiros+spam com.gmail> Dec 23 2006
Let's say that file q.txt contains some characters bigger than 0x7f (for
example, from windows-1252 encoding).
In such case the following snippet:
***
import std.stream;
void main() {
Stream f = new BufferedFile("q.txt");
for (char[] l; f) {
writefln(l);
}
}
***
will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
UTF-8, right?
My question is: is there any way to print out non-UTF-8 data exactly in the
same encoding (which may be unknown) as in original file?
Dec 22 2006
Egor Starostin wrote:Let's say that file q.txt contains some characters bigger than 0x7f (for example, from windows-1252 encoding). In such case the following snippet: *** import std.stream; void main() { Stream f = new BufferedFile("q.txt"); for (char[] l; f) { writefln(l); } } *** will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in UTF-8, right? My question is: is there any way to print out non-UTF-8 data exactly in the same encoding (which may be unknown) as in original file?
It's funny that you should bring this up now. I had a thread over in d.D.learn regarding this very thing. The following should help you get started: char[] Latin1ToUTF8(char[] value){ char[] result; for(uint i=0; i<value.length; i++){ char ch = value[i]; if(ch < 0x80){ result ~= ch; } else{ result ~= 0xC0 | (ch >> 6); result ~= 0x80 | (ch & 0x3F); } } return result; } (this could be optimized to use fewer concatenations, but I think it gets the point across) I have no clue how to work from other code pages, as I gather the transform would be far less than straightforward as Latin-1. Also, I have no idea how to *detect* what code page is being used based on the input set. I don't even know if that's possible, like you, , I'd love to hear about it should someone else know of an algorithm. -- - EricAnderton at yahoo
Dec 22 2006
My question is: is there any way to print out non-UTF-8 data exactly in the same encoding (which may be unknown) as in original file?
d.D.learn regarding this very thing. The following should help you get started: char[] Latin1ToUTF8(char[] value){
I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
Dec 22 2006
"Egor Starostin" <egorst gmail.com> wrote in message news:emgvkj$1ll6$1 digitaldaemon.com...I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));
Dec 22 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jarrett Billingsley schrieb am 2006-12-22:"Egor Starostin" <egorst gmail.com> wrote in message news:emgvkj$1ll6$1 digitaldaemon.com...I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));
This should work more reliable and consume less resources: printf("%.*s\n", l.length, l.ptr); Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFjDDwLK5blCcjpWoRAkg4AJ4uUr0r5t6p2DSD0WYoQU16KqjrmQCfTWjN o4ASI5v294bKKaW1rzDPk54= =/ey0 -----END PGP SIGNATURE-----
Dec 22 2006
Thomas Kuehne wrote:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Jarrett Billingsley schrieb am 2006-12-22:"Egor Starostin" <egorst gmail.com> wrote in message news:emgvkj$1ll6$1 digitaldaemon.com...I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
foreach(l; f) printf("%s\n", toStringz(l));
This should work more reliable and consume less resources: printf("%.*s\n", l.length, l.ptr); Thomas
This works as well. But only because array parts are in the correct order to begin with printf("%.*s\n", l);
Dec 22 2006
Jarrett Billingsley wrote:"Egor Starostin" <egorst gmail.com> wrote in message news:emgvkj$1ll6$1 digitaldaemon.com...I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));
Or rather: dout.write(cast(ubyte[]) line); ? -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Dec 23 2006









BCS <BCS pathilink.com> 