digitalmars.D.learn - Read non-UTF8 file

Nrgyzer (28/28) Feb 13 2011 Hey guys,

Stewart Gordon (7/10) Feb 18 2011 Please post sample input that shows the problem, and the output generate...

Nrgyzer (9/19) Feb 19 2011 My file contains the following:

spir (11/30) Feb 19 2011 At first sight, I find your input strange. Actually, it looks like utf-8...

Nrgyzer (13/43) Feb 20 2011 writeln() doesn't write

Stewart Gordon (22/26) Feb 21 2011 What compiler version/platform are you using? I had to fix some errors ...

Nrgyzer (2/29) Feb 22 2011 I also wondered because I've used the same code in D1 and it worked with...

Nrgyzer <nrgyzer gmail.com> writes:

Hey guys,

I've the following source:

module filereader;

import std.file;
import std.stdio : writeln;

void main(string[] args) {
	File f = new File("myFile.ext", FileMode.In);
	while(!f.eof()) {
		writeln(convertToUTF8(f.readLine()));
	}
	f.close();
}

string convertToUTF8(char[] text) {
	string result;
	for (uint i=0; i<text.length; i++) {
		wchar ch = text[i];
		if (ch < 0x80) {
			result ~= ch;
		} else {
			result ~= 0xC0 | (ch >> 6);
			result ~= 0x80 | (ch & 0x3F);
		}
	}
	return result;
}

It compiles and works as long as the returned char-array/string of f.readLine()
doesn't contain non-UTF8 character(s). If it contains such
chars, writeln() doesn't write anything to the console. Is there any chance to
read such files?

Thanks a lot!

Feb 13 2011

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 13/02/2011 21:49, Nrgyzer wrote:
<snip>
 It compiles and works as long as the returned char-array/string of
f.readLine() doesn't
 contain non-UTF8 character(s). If it contains such chars, writeln() doesn't
write
 anything to the console. Is there any chance to read such files?

Please post sample input that shows the problem, and the output generated by
replacing the 
writeln call with

     writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine()));

so that we can see what it is actually reading in.

Stewart.

Feb 18 2011

Nrgyzer <nrgyzer gmail.com> writes:

== Auszug aus Stewart Gordon (smjg_1998 yahoo.com)'s Artikel
 On 13/02/2011 21:49, Nrgyzer wrote:
 <snip>
 It compiles and works as long as the returned char-array/string of
f.readLine() doesn't
 contain non-UTF8 character(s). If it contains such chars, writeln() doesn't
write
 anything to the console. Is there any chance to read such files?

 Please post sample input that shows the problem, and the output generated by
replacing the
 writeln call with
      writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine()));
 so that we can see what it is actually reading in.
 Stewart.

My file contains the following:

�
�
�

Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I
get the following:

[195, 131, 164]
[195, 131, 182]
[195, 131, 188]

Feb 19 2011

spir <denis.spir gmail.com> writes:

On 02/19/2011 02:42 PM, Nrgyzer wrote:
 == Auszug aus Stewart Gordon (smjg_1998 yahoo.com)'s Artikel
 On 13/02/2011 21:49, Nrgyzer wrote:
 <snip>
 It compiles and works as long as the returned char-array/string of
f.readLine() doesn't
 contain non-UTF8 character(s). If it contains such chars, writeln() doesn't
write
 anything to the console. Is there any chance to read such files?

 Please post sample input that shows the problem, and the output generated by
replacing the
 writeln call with
       writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine()));
 so that we can see what it is actually reading in.
 Stewart.

 My file contains the following:

 �
 �
 �

 Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I
get the following:

 [195, 131, 164]
 [195, 131, 182]
 [195, 131, 188]

At first sight, I find your input strange. Actually, it looks like utf-8 (195 
is common when representing converted latin text). But having 3 times (195, 
131) which is the code for 'Ã' is weird.
What is your source text, what is its encoding, and where does it come from? 
What don't you /start/ and tell us about that?

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Feb 19 2011

Nrgyzer <nrgyzer gmail.com> writes:

== Auszug aus spir (denis.spir gmail.com)'s Artikel
 On 02/19/2011 02:42 PM, Nrgyzer wrote:
 == Auszug aus Stewart Gordon (smjg_1998 yahoo.com)'s Artikel
 On 13/02/2011 21:49, Nrgyzer wrote:
 <snip>
 It compiles and works as long as the returned char-array/string




of f.readLine() doesn't
 contain non-UTF8 character(s). If it contains such chars,




writeln() doesn't write
 anything to the console. Is there any chance to read such files?

 Please post sample input that shows the problem, and the output



generated by replacing the
 writeln call with
       writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine()));
 so that we can see what it is actually reading in.
 Stewart.

 My file contains the following:

 �
 �
 �

 Now... and with writefln("%s", cast(ubyte[])


convertToUTF8(f.readLine())); I get the following:
 [195, 131, 164]
 [195, 131, 182]
 [195, 131, 188]

 At first sight, I find your input strange. Actually, it looks like

utf-8 (195
 is common when representing converted latin text). But having 3

times (195,
 131) which is the code for 'Ã' is weird.
 What is your source text, what is its encoding, and where does it

come from?
 What don't you /start/ and tell us about that?
 Denis

It seems that my input chars doesn't show correctly above... it
contains the following chars:

0xE4 (or 228), 0xF6 (or 246) and 0xFC (or 252)

I used notepad to create the file and saved it as ANSI encoding. The
file is for testing purposes only.

Feb 20 2011

Stewart Gordon <smjg_1998 yahoo.com> writes:

What compiler version/platform are you using?  I had to fix some errors before
it would 
compile on mine (1.066/2.051 Windows).

On 19/02/2011 13:42, Nrgyzer wrote:
<snip>
 Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I
get the following:

 [195, 131, 164]
 [195, 131, 182]
 [195, 131, 188]

It took a while for me to make sense of what's going on!

The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type int.
 It 
appears that, in D2, if you append an int to a string then it treats the int as
a Unicode 
codepoint and automagically converts it to UTF-8.  But why is it doing it on
the first 
byte and not the second?  This looks like a bug.

Casting each UTF-8 byte value to a char

     if (ch < 0x80) {
         result ~= cast(char) ch;
     } else {
         result ~= cast(char) (0xC0 | (ch >> 6));
         result ~= cast(char) (0x80 | (ch & 0x3F));
     }

gives the expected output

[195, 164]
[195, 182]
[195, 188]

HTH

Stewart.

Feb 21 2011

Nrgyzer <nrgyzer gmail.com> writes:

== Auszug aus Stewart Gordon (smjg_1998 yahoo.com)'s Artikel
 What compiler version/platform are you using?  I had to fix some errors before
it would
 compile on mine (1.066/2.051 Windows).
 On 19/02/2011 13:42, Nrgyzer wrote:
 <snip>
 Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I
get the following:

 [195, 131, 164]
 [195, 131, 182]
 [195, 131, 188]

 It took a while for me to make sense of what's going on!
 The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type
int.  It
 appears that, in D2, if you append an int to a string then it treats the int
as a Unicode
 codepoint and automagically converts it to UTF-8.  But why is it doing it on
the first
 byte and not the second?  This looks like a bug.
 Casting each UTF-8 byte value to a char
      if (ch < 0x80) {
          result ~= cast(char) ch;
      } else {
          result ~= cast(char) (0xC0 | (ch >> 6));
          result ~= cast(char) (0x80 | (ch & 0x3F));
      }
 gives the expected output
 [195, 164]
 [195, 182]
 [195, 188]
 HTH
 Stewart.

I also wondered because I've used the same code in D1 and it worked without any
problems. Anyway... thanks :)

Feb 22 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Read non-UTF8 file