digitalmars.D - print non-ASCII/UTF-8 string

Egor Starostin (16/16) Dec 22 2006 Let's say that file q.txt contains some characters bigger than 0x7f (for

Pragma (27/44) Dec 22 2006 It's funny that you should bring this up now. I had a thread over in

Egor Starostin (3/9) Dec 22 2006 It's not my case, I think.

Jarrett Billingsley (5/8) Dec 22 2006 Hm. This might be one case where printf is actually useful:

Thomas Kuehne (11/19) Dec 22 2006 -----BEGIN PGP SIGNED MESSAGE-----

BCS (4/23) Dec 22 2006 This works as well. But only because array parts are in the correct

Bruno Medeiros (7/20) Dec 23 2006 Or rather:

Egor Starostin <egorst gmail.com> writes:

Let's say that file q.txt contains some characters bigger than 0x7f (for
example, from windows-1252 encoding).
In such case the following snippet:
***
import std.stream;
void main() {
  Stream f = new BufferedFile("q.txt");
  for (char[] l; f) {
    writefln(l);
  }
}
***
will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
UTF-8, right?

My question is: is there any way to print out non-UTF-8 data exactly in the
same encoding (which may be unknown) as in original file?

Dec 22 2006

Pragma <ericanderton yahoo.removeme.com> writes:

Egor Starostin wrote:
 Let's say that file q.txt contains some characters bigger than 0x7f (for
 example, from windows-1252 encoding).
 In such case the following snippet:
 ***
 import std.stream;
 void main() {
   Stream f = new BufferedFile("q.txt");
   for (char[] l; f) {
     writefln(l);
   }
 }
 ***
 will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
 UTF-8, right?
 
 My question is: is there any way to print out non-UTF-8 data exactly in the
 same encoding (which may be unknown) as in original file?

It's funny that you should bring this up now.  I had a thread over in 
d.D.learn regarding this very thing.  The following should help you get 
started:

char[] Latin1ToUTF8(char[] value){
     char[] result;
     for(uint i=0; i<value.length; i++){
         char ch = value[i];
         if(ch < 0x80){
             result ~= ch;
         }
         else{
             result ~= 0xC0  | (ch >> 6);
             result ~= 0x80  | (ch & 0x3F);
         }
     }
     return result;
}

(this could be optimized to use fewer concatenations, but I think it 
gets the point across)

I have no clue how to work from other code pages, as I gather the 
transform would be far less than straightforward as Latin-1.

   Also, I have no idea how to *detect* what code page is being used 
based on the input set. I don't even know if that's possible, like you, 
  , I'd love to hear about it should someone else know of an algorithm.

-- 
- EricAnderton at yahoo

Dec 22 2006

Egor Starostin <egorst gmail.com> writes:

 My question is: is there any way to print out non-UTF-8 data exactly in the
 same encoding (which may be unknown) as in original file?

 It's funny that you should bring this up now.  I had a thread over in
 d.D.learn regarding this very thing.  The following should help you get
 started:
 char[] Latin1ToUTF8(char[] value){

It's not my case, I think.

I don't need to convert to UTF-8. I just need to raw print exactly the same
string
as in original file.

Dec 22 2006

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Egor Starostin" <egorst gmail.com> wrote in message 
news:emgvkj$1ll6$1 digitaldaemon.com...

 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

Hm.  This might be one case where printf is actually useful:

foreach(l; f)
    printf("%s\n", toStringz(l));

Dec 22 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jarrett Billingsley schrieb am 2006-12-22:
 "Egor Starostin" <egorst gmail.com> wrote in message 
 news:emgvkj$1ll6$1 digitaldaemon.com...

 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

 Hm.  This might be one case where printf is actually useful:

 foreach(l; f)
     printf("%s\n", toStringz(l)); 

This should work more reliable and consume less resources:
      printf("%.*s\n", l.length, l.ptr);

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFFjDDwLK5blCcjpWoRAkg4AJ4uUr0r5t6p2DSD0WYoQU16KqjrmQCfTWjN
o4ASI5v294bKKaW1rzDPk54=
=/ey0
-----END PGP SIGNATURE-----

Dec 22 2006

BCS <BCS pathilink.com> writes:

Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 Jarrett Billingsley schrieb am 2006-12-22:
 "Egor Starostin" <egorst gmail.com> wrote in message 
 news:emgvkj$1ll6$1 digitaldaemon.com...

 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

 Hm.  This might be one case where printf is actually useful:

 foreach(l; f)
     printf("%s\n", toStringz(l)); 

 
 This should work more reliable and consume less resources:
       printf("%.*s\n", l.length, l.ptr);
 
 Thomas

This works as well. But only because array parts are in the correct 
order to begin with

printf("%.*s\n", l);

Dec 22 2006

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Jarrett Billingsley wrote:
 "Egor Starostin" <egorst gmail.com> wrote in message 
 news:emgvkj$1ll6$1 digitaldaemon.com...
 
 I don't need to convert to UTF-8. I just need to raw print exactly the 
 same string
 as in original file.

 
 Hm.  This might be one case where printf is actually useful:
 
 foreach(l; f)
     printf("%s\n", toStringz(l)); 
 
 

Or rather:
   dout.write(cast(ubyte[]) line);
?

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Dec 23 2006

D Programming

C/C++ Programming

Other

digitalmars.D - print non-ASCII/UTF-8 string