www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - print non-ASCII/UTF-8 string

reply Egor Starostin <egorst gmail.com> writes:
Let's say that file q.txt contains some characters bigger than 0x7f (for
example, from windows-1252 encoding).
In such case the following snippet:
***
import std.stream;
void main() {
  Stream f = new BufferedFile("q.txt");
  for (char[] l; f) {
    writefln(l);
  }
}
***
will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
UTF-8, right?

My question is: is there any way to print out non-UTF-8 data exactly in the
same encoding (which may be unknown) as in original file?
Dec 22 2006
parent reply Pragma <ericanderton yahoo.removeme.com> writes:
Egor Starostin wrote:
 Let's say that file q.txt contains some characters bigger than 0x7f (for
 example, from windows-1252 encoding).
 In such case the following snippet:
 ***
 import std.stream;
 void main() {
   Stream f = new BufferedFile("q.txt");
   for (char[] l; f) {
     writefln(l);
   }
 }
 ***
 will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
 UTF-8, right?
 
 My question is: is there any way to print out non-UTF-8 data exactly in the
 same encoding (which may be unknown) as in original file?

It's funny that you should bring this up now. I had a thread over in d.D.learn regarding this very thing. The following should help you get started: char[] Latin1ToUTF8(char[] value){ char[] result; for(uint i=0; i<value.length; i++){ char ch = value[i]; if(ch < 0x80){ result ~= ch; } else{ result ~= 0xC0 | (ch >> 6); result ~= 0x80 | (ch & 0x3F); } } return result; } (this could be optimized to use fewer concatenations, but I think it gets the point across) I have no clue how to work from other code pages, as I gather the transform would be far less than straightforward as Latin-1. Also, I have no idea how to *detect* what code page is being used based on the input set. I don't even know if that's possible, like you, , I'd love to hear about it should someone else know of an algorithm. -- - EricAnderton at yahoo
Dec 22 2006
parent reply Egor Starostin <egorst gmail.com> writes:
 My question is: is there any way to print out non-UTF-8 data exactly in the
 same encoding (which may be unknown) as in original file?

d.D.learn regarding this very thing. The following should help you get started: char[] Latin1ToUTF8(char[] value){

I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
Dec 22 2006
parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Egor Starostin" <egorst gmail.com> wrote in message 
news:emgvkj$1ll6$1 digitaldaemon.com...

 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));
Dec 22 2006
next sibling parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jarrett Billingsley schrieb am 2006-12-22:
 "Egor Starostin" <egorst gmail.com> wrote in message 
 news:emgvkj$1ll6$1 digitaldaemon.com...

 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));

This should work more reliable and consume less resources: printf("%.*s\n", l.length, l.ptr); Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFjDDwLK5blCcjpWoRAkg4AJ4uUr0r5t6p2DSD0WYoQU16KqjrmQCfTWjN o4ASI5v294bKKaW1rzDPk54= =/ey0 -----END PGP SIGNATURE-----
Dec 22 2006
parent BCS <BCS pathilink.com> writes:
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Jarrett Billingsley schrieb am 2006-12-22:
 "Egor Starostin" <egorst gmail.com> wrote in message 
 news:emgvkj$1ll6$1 digitaldaemon.com...

 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

foreach(l; f) printf("%s\n", toStringz(l));

This should work more reliable and consume less resources: printf("%.*s\n", l.length, l.ptr); Thomas

This works as well. But only because array parts are in the correct order to begin with printf("%.*s\n", l);
Dec 22 2006
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Jarrett Billingsley wrote:
 "Egor Starostin" <egorst gmail.com> wrote in message 
 news:emgvkj$1ll6$1 digitaldaemon.com...
 
 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));

Or rather: dout.write(cast(ubyte[]) line); ? -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Dec 23 2006