digitalmars.D.learn - UTF-8 problems

Deewiant (21/21) Jun 12 2006 import std.stream, std.cstream;

Oskar Linde (13/41) Jun 12 2006 I had a quick look at the std.stream sources and it seems std.stream

Deewiant (8/54) Jun 12 2006 Thanks for the explanation. Unfortunately, I'm not knowledgeable enough ...

Oskar Linde (6/62) Jun 12 2006 dchar std.utf.decode(char[],int)

Deewiant (31/47) Jun 12 2006 Thanks, that works. What I did was write a short function looking like t...

Oskar Linde (16/58) Jun 12 2006 For a more general implementation, change the last 3 lines to:

Deewiant (10/80) Jun 12 2006 6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride...

Carlos Santander (7/17) Jun 12 2006 Keep using readLine. The entire line should be made of valid UTF8 charac...

Deewiant (5/18) Jun 12 2006 That would work, but I was originally using only getc() so it's easier f...

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

import std.stream, std.cstream;

// åäöΔ

void main() {
	Stream file = new File(__FILE__, FileMode.In);
	// alternatively:
	//Stream file = din;

	while (!file.eof)
		dout.writef("%s", file.getc);
}
--

With the above UTF-8 code, I expect the program's source to be output, also in
UTF-8. However, I get ASCII output, and on line three appears everyone's
favourite "Error: 4invalid UTF-8 sequence".

Furthermore, unless I use the "alternative" where std.cstream.din is used, the
two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
them to become \r\n in the output. Whence the extra \r?

What's strange is if I use e.g. readLine instead of getc, everything is fine.
Since readLine seems to use getc internally, I'm having trouble understanding
why this is the case.

A bug or two, or where am I going wrong?

Jun 12 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Deewiant skrev:
 import std.stream, std.cstream;
 
 // åäöΔ
 
 void main() {
 	Stream file = new File(__FILE__, FileMode.In);
 	// alternatively:
 	//Stream file = din;
 
 	while (!file.eof)
 		dout.writef("%s", file.getc);
 }
 --
 
 With the above UTF-8 code, I expect the program's source to be output, also in
 UTF-8. However, I get ASCII output, and on line three appears everyone's
 favourite "Error: 4invalid UTF-8 sequence".
 
 Furthermore, unless I use the "alternative" where std.cstream.din is used, the
 two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
 encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
 them to become \r\n in the output. Whence the extra \r?
 
 What's strange is if I use e.g. readLine instead of getc, everything is fine.
 Since readLine seems to use getc internally, I'm having trouble understanding
 why this is the case.
 
 A bug or two, or where am I going wrong?

I had a quick look at the std.stream sources and it seems std.stream 
isn't really unicode aware. getc() assumes the stream to be in utf-8 and 
returns a char, which means it returns a utf8 code unit, not a full 
character. getcw() on the other hand assumes the string is in utf-16 and 
returns a utf-16 code unit as a wchar.

You are printing individial utf-8 code units as characters, which 
triggers your error.

If D claims to have full unicode support, std.stream ought to either 
have decoding routines that return a dchar, or have a utf-decoding 
wrapper stream, in which case std.stream.getc() ought to return a ubyte, 
not a char...

/Oskar

Jun 12 2006

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

Oskar Linde wrote:
 Deewiant skrev:
 import std.stream, std.cstream;

 // åäöΔ

 void main() {
     Stream file = new File(__FILE__, FileMode.In);
     // alternatively:
     //Stream file = din;

     while (!file.eof)
         dout.writef("%s", file.getc);
 }
 -- 

 With the above UTF-8 code, I expect the program's source to be output,
 also in
 UTF-8. However, I get ASCII output, and on line three appears everyone's
 favourite "Error: 4invalid UTF-8 sequence".

 Furthermore, unless I use the "alternative" where std.cstream.din is
 used, the
 two line breaks after "std.cstream;" are not \r\n as they should be in
 the DOS
 encoding I use, they are \r\r\n. Converting the line breaks to just \n
 causes
 them to become \r\n in the output. Whence the extra \r?

 What's strange is if I use e.g. readLine instead of getc, everything
 is fine.
 Since readLine seems to use getc internally, I'm having trouble
 understanding
 why this is the case.

 A bug or two, or where am I going wrong?

 
 I had a quick look at the std.stream sources and it seems std.stream
 isn't really unicode aware. getc() assumes the stream to be in utf-8 and
 returns a char, which means it returns a utf8 code unit, not a full
 character. getcw() on the other hand assumes the string is in utf-16 and
 returns a utf-16 code unit as a wchar.
 
 You are printing individial utf-8 code units as characters, which
 triggers your error.
 
 /Oskar

Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these
matters to correct the problem.

So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
combine the former two into a single "char"?

Say I check if the char received from getc() is greater than 127 (outside ASCII)
and if it is, I store it and the following char in two ubytes. Now what? How do
I get a char?

Jun 12 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Deewiant skrev:
 Oskar Linde wrote:
 Deewiant skrev:
 import std.stream, std.cstream;

 // åäöΔ

 void main() {
     Stream file = new File(__FILE__, FileMode.In);
     // alternatively:
     //Stream file = din;

     while (!file.eof)
         dout.writef("%s", file.getc);
 }
 -- 

 With the above UTF-8 code, I expect the program's source to be output,
 also in
 UTF-8. However, I get ASCII output, and on line three appears everyone's
 favourite "Error: 4invalid UTF-8 sequence".

 Furthermore, unless I use the "alternative" where std.cstream.din is
 used, the
 two line breaks after "std.cstream;" are not \r\n as they should be in
 the DOS
 encoding I use, they are \r\r\n. Converting the line breaks to just \n
 causes
 them to become \r\n in the output. Whence the extra \r?

 What's strange is if I use e.g. readLine instead of getc, everything
 is fine.
 Since readLine seems to use getc internally, I'm having trouble
 understanding
 why this is the case.

 A bug or two, or where am I going wrong?

 I had a quick look at the std.stream sources and it seems std.stream
 isn't really unicode aware. getc() assumes the stream to be in utf-8 and
 returns a char, which means it returns a utf8 code unit, not a full
 character. getcw() on the other hand assumes the string is in utf-16 and
 returns a utf-16 code unit as a wchar.

 You are printing individial utf-8 code units as characters, which
 triggers your error.

 /Oskar

 
 Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in
these
 matters to correct the problem.
 
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
ASCII)
 and if it is, I store it and the following char in two ubytes. Now what? How do
 I get a char?

dchar std.utf.decode(char[],int)

even if it can be quite clumsy. A hint is to use:

std.utf.UTF8stride[c] to get the total number of bytes that are part of 
the starting token c.

/Oskar

Jun 12 2006

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

Oskar Linde wrote:
 Deewiant skrev:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
 ASCII) and if it is, I store it and the following char in two ubytes. Now 
 what? How do I get a char?

 
 dchar std.utf.decode(char[],int)
 
 even if it can be quite clumsy. A hint is to use:
 
 std.utf.UTF8stride[c] to get the total number of bytes that are part of the
 starting token c.
 
 /Oskar

Thanks, that works. What I did was write a short function looking like this:

dchar myGetchar(Stream s) {
	char c = s.getc;

	// ASCII
	if (c <= 127)
		return c;
	else {
		// UTF-8
		char[] str = new char[2];
		str[0] = c;
		str[1] = s.getc;

		// dummy var, needed by decode
		size_t i = 0;
		return decode(str, i);
	}
}

Using that in place of getc() pretty much does the trick.

Unfortunately, when reading from files instead of stdin, I still run into the
problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
being converted into \r\n because I'm on a Windows platform. I use the following
workaround:

if (c == '\r') {
	char d = s.getc;
	if (d == '\n')
		return '\n';
	else {
		s.ungetc(d);
		return c;
	}
}

Jun 12 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Deewiant skrev:
 Oskar Linde wrote:
 Deewiant skrev:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
 combine the former two into a single "char"?

 Say I check if the char received from getc() is greater than 127 (outside
 ASCII) and if it is, I store it and the following char in two ubytes. Now 
 what? How do I get a char?

 dchar std.utf.decode(char[],int)

 even if it can be quite clumsy. A hint is to use:

 std.utf.UTF8stride[c] to get the total number of bytes that are part of the
 starting token c.

 /Oskar

 
 Thanks, that works. What I did was write a short function looking like this:

This only works for a small subset of Unicode...

 dchar myGetchar(Stream s) {
 	char c = s.getc;
 
 	// ASCII
 	if (c <= 127)
 		return c;
 	else {
 		// UTF-8
 		char[] str = new char[2];
 		str[0] = c;
 		str[1] = s.getc;

For a more general implementation, change the last 3 lines to:

		char[6] str;
                 str[0] = c;
                 int n = std.utf.UTF8stride[c];
                 if (n == 0xff)
                         return cast(dchar)-1;; // corrupt string
                 for (int i = 1; i < n; i++)
                         str[i] = s.getc;

 
 		// dummy var, needed by decode
 		size_t i = 0;
 		return decode(str, i);
 	}
 }
 
 Using that in place of getc() pretty much does the trick.
 
 Unfortunately, when reading from files instead of stdin, I still run into the
 problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
 being converted into \r\n because I'm on a Windows platform. I use the
following
 workaround:

Yes. This is another proof that std.stream is lacking functionality. 
Because of this conversion, it is clear that std.stream isn't a binary 
stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 
encoded text stream, and in those cases std.stream.getc should have a 
function returning a dchar, just as the above code.

/Oskar

Jun 12 2006

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

Oskar Linde wrote:
 Deewiant skrev:
 Oskar Linde wrote:
 Deewiant skrev:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä".
 How do I combine the former two into a single "char"?

 Say I check if the char received from getc() is greater than 127
 (outside
 ASCII) and if it is, I store it and the following char in two
 ubytes. Now what? How do I get a char?

 dchar std.utf.decode(char[],int)

 even if it can be quite clumsy. A hint is to use:

 std.utf.UTF8stride[c] to get the total number of bytes that are part
 of the
 starting token c.

 /Oskar

 Thanks, that works. What I did was write a short function looking like
 this:

 
 This only works for a small subset of Unicode...

Thanks for correcting it, I was unsure myself.

 dchar myGetchar(Stream s) {
     char c = s.getc;

     // ASCII
     if (c <= 127)
         return c;
     else {
         // UTF-8
         char[] str = new char[2];
         str[0] = c;
         str[1] = s.getc;

 
 For a more general implementation, change the last 3 lines to:
 
         char[6] str;

6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride also
has 5 or 6 as some of its elements; why is that?

                 str[0] = c;
                 int n = std.utf.UTF8stride[c];
                 if (n == 0xff)
                         return cast(dchar)-1;; // corrupt string
                 for (int i = 1; i < n; i++)
                         str[i] = s.getc;
 
         // dummy var, needed by decode
         size_t i = 0;
         return decode(str, i);
     }
 }

 Using that in place of getc() pretty much does the trick.

 Unfortunately, when reading from files instead of stdin, I still run
 into the
 problem of \r\n being converted to \r\r\n. I think I know why, too:
 '\n' is
 being converted into \r\n because I'm on a Windows platform. I use the
 following
 workaround:

 
 Yes. This is another proof that std.stream is lacking functionality.
 Because of this conversion, it is clear that std.stream isn't a binary
 stream, and as such, it ought to be either a utf-8, utf-16 or utf-32
 encoded text stream, and in those cases std.stream.getc should have a
 function returning a dchar, just as the above code.
 
 /Oskar

Yes, I agree wholeheartedly. It would appear that the std.stream classes are for
textual input, but currently some of the methods choke on UTF-x input.

In addition to a getcd() method to complement getc() and getcw(), a getb()
method returning an ubyte might also be handy, for when one really wants
byte-by-byte input. Perhaps getc()'s signature should actually be changed into
that, since after all that's all it seems currently to be doing.

Jun 12 2006

Carlos Santander <csantander619 gmail.com> writes:

Deewiant escribió:
 
 Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in
these
 matters to correct the problem.
 
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
ASCII)
 and if it is, I store it and the following char in two ubytes. Now what? How do
 I get a char?

Keep using readLine. The entire line should be made of valid UTF8 characters.

Maybe something to do about it would be to add getUTF8char, getUTF16char and 
getUTF32char, which would return char[], wchar[] and dchar, respectively, the 
first one returning an array of 1 to 4 elements, and the second 1 or 2.

-- 
Carlos Santander Bernal

Jun 12 2006

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

Carlos Santander wrote:
 Deewiant escribió:
 So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
 combine the former two into a single "char"?
 
 Say I check if the char received from getc() is greater than 127 (outside
 ASCII) and if it is, I store it and the following char in two ubytes. Now 
 what? How do I get a char?

 
 Keep using readLine. The entire line should be made of valid UTF8 characters.

That would work, but I was originally using only getc() so it's easier for me to
replace that than to change half of my input paradigm. <g>

 Maybe something to do about it would be to add getUTF8char, getUTF16char and
 getUTF32char, which would return char[], wchar[] and dchar, respectively, the
 first one returning an array of 1 to 4 elements, and the second 1 or 2.
 

Something like that would indeed be handy. It's too bad std.stream is lacking in
some respects, such as this.

Jun 12 2006

D Programming

C/C++ Programming

Other

digitalmars.D.learn - UTF-8 problems