www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - UTF-8 problems

reply Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
import std.stream, std.cstream;

// åäöΔ

void main() {
	Stream file = new File(__FILE__, FileMode.In);
	// alternatively:
	//Stream file = din;

	while (!file.eof)
		dout.writef("%s", file.getc);
}
--

With the above UTF-8 code, I expect the program's source to be output, also in
UTF-8. However, I get ASCII output, and on line three appears everyone's
favourite "Error: 4invalid UTF-8 sequence".

Furthermore, unless I use the "alternative" where std.cstream.din is used, the
two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
them to become \r\n in the output. Whence the extra \r?

What's strange is if I use e.g. readLine instead of getc, everything is fine.
Since readLine seems to use getc internally, I'm having trouble understanding
why this is the case.

A bug or two, or where am I going wrong?
Jun 12 2006
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Deewiant skrev:
 import std.stream, std.cstream;
 
 // åäöΔ
 
 void main() {
 	Stream file = new File(__FILE__, FileMode.In);
 	// alternatively:
 	//Stream file = din;
 
 	while (!file.eof)
 		dout.writef("%s", file.getc);
 }
 --
 
 With the above UTF-8 code, I expect the program's source to be output, also in
 UTF-8. However, I get ASCII output, and on line three appears everyone's
 favourite "Error: 4invalid UTF-8 sequence".
 
 Furthermore, unless I use the "alternative" where std.cstream.din is used, the
 two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
 encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
 them to become \r\n in the output. Whence the extra \r?
 
 What's strange is if I use e.g. readLine instead of getc, everything is fine.
 Since readLine seems to use getc internally, I'm having trouble understanding
 why this is the case.
 
 A bug or two, or where am I going wrong?

I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar. You are printing individial utf-8 code units as characters, which triggers your error. If D claims to have full unicode support, std.stream ought to either have decoding routines that return a dchar, or have a utf-decoding wrapper stream, in which case std.stream.getc() ought to return a ubyte, not a char... /Oskar
Jun 12 2006
parent reply Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
Oskar Linde wrote:
 Deewiant skrev:
 import std.stream, std.cstream;

 // åäöΔ

 void main() {
     Stream file = new File(__FILE__, FileMode.In);
     // alternatively:
     //Stream file = din;

     while (!file.eof)
         dout.writef("%s", file.getc);
 }
 -- 

 With the above UTF-8 code, I expect the program's source to be output,
 also in
 UTF-8. However, I get ASCII output, and on line three appears everyone's
 favourite "Error: 4invalid UTF-8 sequence".

 Furthermore, unless I use the "alternative" where std.cstream.din is
 used, the
 two line breaks after "std.cstream;" are not \r\n as they should be in
 the DOS
 encoding I use, they are \r\r\n. Converting the line breaks to just \n
 causes
 them to become \r\n in the output. Whence the extra \r?

 What's strange is if I use e.g. readLine instead of getc, everything
 is fine.
 Since readLine seems to use getc internally, I'm having trouble
 understanding
 why this is the case.

 A bug or two, or where am I going wrong?

I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar. You are printing individial utf-8 code units as characters, which triggers your error. /Oskar

Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these matters to correct the problem. So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?
Jun 12 2006
next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Deewiant skrev:
 Oskar Linde wrote:
 Deewiant skrev:
 import std.stream, std.cstream;

 // åäöΔ

 void main() {
     Stream file = new File(__FILE__, FileMode.In);
     // alternatively:
     //Stream file = din;

     while (!file.eof)
         dout.writef("%s", file.getc);
 }
 -- 

 With the above UTF-8 code, I expect the program's source to be output,
 also in
 UTF-8. However, I get ASCII output, and on line three appears everyone's
 favourite "Error: 4invalid UTF-8 sequence".

 Furthermore, unless I use the "alternative" where std.cstream.din is
 used, the
 two line breaks after "std.cstream;" are not \r\n as they should be in
 the DOS
 encoding I use, they are \r\r\n. Converting the line breaks to just \n
 causes
 them to become \r\n in the output. Whence the extra \r?

 What's strange is if I use e.g. readLine instead of getc, everything
 is fine.
 Since readLine seems to use getc internally, I'm having trouble
 understanding
 why this is the case.

 A bug or two, or where am I going wrong?

isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar. You are printing individial utf-8 code units as characters, which triggers your error. /Oskar

Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these matters to correct the problem. So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?

dchar std.utf.decode(char[],int) even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskar
Jun 12 2006
parent reply Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
Oskar Linde wrote:
 Deewiant skrev:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
 ASCII) and if it is, I store it and the following char in two ubytes. Now 
 what? How do I get a char?

dchar std.utf.decode(char[],int) even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskar

Thanks, that works. What I did was write a short function looking like this: dchar myGetchar(Stream s) { char c = s.getc; // ASCII if (c <= 127) return c; else { // UTF-8 char[] str = new char[2]; str[0] = c; str[1] = s.getc; // dummy var, needed by decode size_t i = 0; return decode(str, i); } } Using that in place of getc() pretty much does the trick. Unfortunately, when reading from files instead of stdin, I still run into the problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is being converted into \r\n because I'm on a Windows platform. I use the following workaround: if (c == '\r') { char d = s.getc; if (d == '\n') return '\n'; else { s.ungetc(d); return c; } }
Jun 12 2006
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Deewiant skrev:
 Oskar Linde wrote:
 Deewiant skrev:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
 combine the former two into a single "char"?

 Say I check if the char received from getc() is greater than 127 (outside
 ASCII) and if it is, I store it and the following char in two ubytes. Now 
 what? How do I get a char?

even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskar

Thanks, that works. What I did was write a short function looking like this:

This only works for a small subset of Unicode...
 dchar myGetchar(Stream s) {
 	char c = s.getc;
 
 	// ASCII
 	if (c <= 127)
 		return c;
 	else {
 		// UTF-8
 		char[] str = new char[2];
 		str[0] = c;
 		str[1] = s.getc;

For a more general implementation, change the last 3 lines to: char[6] str; str[0] = c; int n = std.utf.UTF8stride[c]; if (n == 0xff) return cast(dchar)-1;; // corrupt string for (int i = 1; i < n; i++) str[i] = s.getc;
 
 		// dummy var, needed by decode
 		size_t i = 0;
 		return decode(str, i);
 	}
 }
 
 Using that in place of getc() pretty much does the trick.
 
 Unfortunately, when reading from files instead of stdin, I still run into the
 problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
 being converted into \r\n because I'm on a Windows platform. I use the
following
 workaround:

Yes. This is another proof that std.stream is lacking functionality. Because of this conversion, it is clear that std.stream isn't a binary stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 encoded text stream, and in those cases std.stream.getc should have a function returning a dchar, just as the above code. /Oskar
Jun 12 2006
parent Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
Oskar Linde wrote:
 Deewiant skrev:
 Oskar Linde wrote:
 Deewiant skrev:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä".
 How do I combine the former two into a single "char"?

 Say I check if the char received from getc() is greater than 127
 (outside
 ASCII) and if it is, I store it and the following char in two
 ubytes. Now what? How do I get a char?

even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskar

Thanks, that works. What I did was write a short function looking like this:

This only works for a small subset of Unicode...

Thanks for correcting it, I was unsure myself.
 dchar myGetchar(Stream s) {
     char c = s.getc;

     // ASCII
     if (c <= 127)
         return c;
     else {
         // UTF-8
         char[] str = new char[2];
         str[0] = c;
         str[1] = s.getc;

For a more general implementation, change the last 3 lines to: char[6] str;

6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride also has 5 or 6 as some of its elements; why is that?
                 str[0] = c;
                 int n = std.utf.UTF8stride[c];
                 if (n == 0xff)
                         return cast(dchar)-1;; // corrupt string
                 for (int i = 1; i < n; i++)
                         str[i] = s.getc;
 
         // dummy var, needed by decode
         size_t i = 0;
         return decode(str, i);
     }
 }

 Using that in place of getc() pretty much does the trick.

 Unfortunately, when reading from files instead of stdin, I still run
 into the
 problem of \r\n being converted to \r\r\n. I think I know why, too:
 '\n' is
 being converted into \r\n because I'm on a Windows platform. I use the
 following
 workaround:

Yes. This is another proof that std.stream is lacking functionality. Because of this conversion, it is clear that std.stream isn't a binary stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 encoded text stream, and in those cases std.stream.getc should have a function returning a dchar, just as the above code. /Oskar

Yes, I agree wholeheartedly. It would appear that the std.stream classes are for textual input, but currently some of the methods choke on UTF-x input. In addition to a getcd() method to complement getc() and getcw(), a getb() method returning an ubyte might also be handy, for when one really wants byte-by-byte input. Perhaps getc()'s signature should actually be changed into that, since after all that's all it seems currently to be doing.
Jun 12 2006
prev sibling parent reply Carlos Santander <csantander619 gmail.com> writes:
Deewiant escribió:
 
 Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in
these
 matters to correct the problem.
 
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
ASCII)
 and if it is, I store it and the following char in two ubytes. Now what? How do
 I get a char?

Keep using readLine. The entire line should be made of valid UTF8 characters. Maybe something to do about it would be to add getUTF8char, getUTF16char and getUTF32char, which would return char[], wchar[] and dchar, respectively, the first one returning an array of 1 to 4 elements, and the second 1 or 2. -- Carlos Santander Bernal
Jun 12 2006
parent Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
Carlos Santander wrote:
 Deewiant escribió:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
 ASCII) and if it is, I store it and the following char in two ubytes. Now 
 what? How do I get a char?

Keep using readLine. The entire line should be made of valid UTF8 characters.

That would work, but I was originally using only getc() so it's easier for me to replace that than to change half of my input paradigm. <g>
 Maybe something to do about it would be to add getUTF8char, getUTF16char and
 getUTF32char, which would return char[], wchar[] and dchar, respectively, the
 first one returning an array of 1 to 4 elements, and the second 1 or 2.
 

Something like that would indeed be handy. It's too bad std.stream is lacking in some respects, such as this.
Jun 12 2006