digitalmars.D.learn - UTF8 Encoding: Again

jicman (31/31) Aug 09 2005 Greetings!

Stefan (11/45) Aug 09 2005 Seems correct to me: 'ao�t' is 61 6f c3 bb 74 in binary which will be

Stefan (6/15) Aug 09 2005 Just realized that Notepad on XP works even without the BOM now.

jicman (10/27) Aug 09 2005 Thanks for the help, Stefan. The problem is much deeper than that. I g...

Stefan Zobel (6/15) Aug 09 2005 Are you sure you're reading/interpreting the file as UTF8 (and it actual...

jicman (14/25) Aug 09 2005 Well, here is a question: do I have to change the data that I work with?...

Derek Parnell (19/40) Aug 09 2005 Technically, if it contains accented characters it is *not* ASCII. It is
Regan Heath (19/40) Aug 09 2005 Yes and No.

jicman (4/125) Aug 10 2005 WOW! I thought everything was over about the subject, and then, I hit t...

Carlos Santander (5/49) Aug 09 2005 It must be something with the editor or with the console, because I just...

Carlos Santander (4/8) Aug 10 2005 That should've been 0.15

jicman <jicman_member pathlink.com> writes:

Greetings!

Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
you may be able to help me jump it or break through it.

I have this code,
char[] GetMonthDigit(char[] mon)
{
char [][char[]] sMon;
sMon["ao�t"] = "08";
return sMon[mon];
}

if I call this from within the program

char[] mon = GetMonthDigit("ao�t");

mon will have "08", however if it comes from another text file, say an ASCII
file, it fails.  So, if you take a look at this small piece of code,

import std.stream;
void main()
{
char[] fn = "test.log";
char[] txt = "ao�t";
File log = new File(fn,FileMode.Append);
log.writeLine(txt);
log.close();
}

this will create a file, if it does not exists, called test.log after compile
and run, and it will write, supposely, the content of txt.  However, when you
open the test.log file, the content of it is,

août

Hmmmm... I tried setting the editor settings to UTF8, and others, however,
nothing has worked.  Any ideas how I can fix this?

thanks,

jos�

Aug 09 2005

Stefan <Stefan_member pathlink.com> writes:

 ...  However, when you
 open the test.log file, the content of it is,

août

Seems correct to me: 'ao�t' is 61 6f c3 bb 74 in binary which will be
interpreted as 'août' in ASCII mode. So I think the editor that you
use to view the file is the problem. A lot of editors (e.g. Notepad)
need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is 
ef bb bf. Delete these with an hex editor and it'll give you the same
output. Anyways, you have to make sure that you read/interpret your 
UTF8 file as UTF8, not ASCII.

HTH,
Stefan



In article <ddb1bt$bcp$1 digitaldaemon.com>, jicman says...
Greetings!

Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
you may be able to help me jump it or break through it.

I have this code,
char[] GetMonthDigit(char[] mon)
{
char [][char[]] sMon;
sMon["ao�t"] = "08";
return sMon[mon];
}

if I call this from within the program

char[] mon = GetMonthDigit("ao�t");

mon will have "08", however if it comes from another text file, say an ASCII
file, it fails.  So, if you take a look at this small piece of code,

import std.stream;
void main()
{
char[] fn = "test.log";
char[] txt = "ao�t";
File log = new File(fn,FileMode.Append);
log.writeLine(txt);
log.close();
}

this will create a file, if it does not exists, called test.log after compile
and run, and it will write, supposely, the content of txt.  However, when you
open the test.log file, the content of it is,

août

Hmmmm... I tried setting the editor settings to UTF8, and others, however,
nothing has worked.  Any ideas how I can fix this?

thanks,

jos�

Aug 09 2005

Stefan <Stefan_member pathlink.com> writes:

In article <ddb5u6$fmn$1 digitaldaemon.com>, Stefan says...
 ...  However, when you
 open the test.log file, the content of it is,

août

 ... A lot of editors (e.g. Notepad)
need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is 
ef bb bf. Delete these with an hex editor and it'll give you the same
output.

Just realized that Notepad on XP works even without the BOM now.
You can reproduce it with WordPad, however. Any decent XML editor
should be able to display correctly without BOMs.

Best regards,
Stefan

Aug 09 2005

jicman <jicman_member pathlink.com> writes:

Stefan says...
In article <ddb5u6$fmn$1 digitaldaemon.com>, Stefan says...
 ...  However, when you
 open the test.log file, the content of it is,

août

 ... A lot of editors (e.g. Notepad)
need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is 
ef bb bf. Delete these with an hex editor and it'll give you the same
output.

Just realized that Notepad on XP works even without the BOM now.
You can reproduce it with WordPad, however. Any decent XML editor
should be able to display correctly without BOMs.

Best regards,
Stefan

Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "ao�t" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "ao�t", and
the test data inside the program, "ao�t", never match.

I guess I still don't understand UTF. :-o

I will have to rewrite this whole check check function because of this.

Thanks for the help.

jos�

Aug 09 2005

Stefan Zobel <Stefan_member pathlink.com> writes:

In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "ao�t" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "ao�t", and
the test data inside the program, "ao�t", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is
UTF8 encoded)? Nevertheless, good luck! If there's something to learn from
your investigations let us other D newbies know ;-)

Best regards,
Stefan


I guess I still don't understand UTF. :-o

I will have to rewrite this whole check check function because of this.

Thanks for the help.

jos�

Aug 09 2005

jicman <jicman_member pathlink.com> writes:

Stefan Zobel says...
In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "ao�t" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "ao�t", and
the test data inside the program, "ao�t", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is
UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with?  For
example, I have this 200+ files with text data which contain ASCII data with
many different accented characters.  Do I need to change this input data to UTF8
to be able to work with it?  I know that I have to save the source code files as
UTF8, but do I also have to change the other text files that I work with to
UTF8?  That is my problem.  I am saving the source files ok, but the input that
I read from text files are not matching the the source code.  Again, do I need
to change that input data to UTF8?


 Nevertheless, good luck! If there's something to learn from
your investigations let us other D newbies know ;-)

There is nothing to learn. ;-)  All I am going to do is to change any character
higher than 127 to +. :-)  That's how I have been able to work with this UTF
stuff. :-)

thanks,

jos�

Aug 09 2005

Derek Parnell <derek psych.ward> writes:

On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman wrote:

 Stefan Zobel says...
In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "ao�t" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "ao�t", and
the test data inside the program, "ao�t", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is
UTF8 encoded)?

 
 Well, here is a question: do I have to change the data that I work with?  For
 example, I have this 200+ files with text data which contain ASCII data with
 many different accented characters.  Do I need to change this input data to
UTF8
 to be able to work with it?  I know that I have to save the source code files
as
 UTF8, but do I also have to change the other text files that I work with to
 UTF8?  That is my problem.  I am saving the source files ok, but the input that
 I read from text files are not matching the the source code.  Again, do I need
 to change that input data to UTF8?

Technically, if it contains accented characters it is *not* ASCII. It is
some other form of character encoding. For example, my Windows XP has Code
Page 850 set for the DOS console. 

( http://en.wikipedia.org/wiki/Code_page_850 )

You would need to find out which character encoding standard was used in
your file, then read the file in as a stream of *bytes* not chars, and
convert each of the byte values into the equivalent Unicode character. You
could then use UTF8 "char[]", UTF16 "wchar[]", or UTF32 "dchar[]" as your
preferred coding in your program. 
 
Also have a look at 

  http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues 

for further help.
-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
10/08/2005 1:03:43 PM

Aug 09 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Stefan Zobel says...
 In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
 Thanks for the help, Stefan.  The problem is much deeper than that.  I  
 get input
 string from a file which contains certain words, say "ao�t" which I  
 need to test
 on.  The problem is that even though I have a test on the program,  
 which
 displays correctly, the test never matches because the input data,  
 "ao�t", and
 the test data inside the program, "ao�t", never match.

 Are you sure you're reading/interpreting the file as UTF8 (and it  
 actually is
 UTF8 encoded)?

 Well, here is a question: do I have to change the data that I work  
 with?  For example, I have this 200+ files with text data which contain  
 ASCII data with many different accented characters.  Do I need to change  
 this input data to UTF8  to be able to work with it?

Yes and No.

As Derek said, if it has characters above 127 it's not ascii, see:
   http://www.columbia.edu/kermit/csettables.html

I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1  
Latin Alphabet 1" which are very similar.

To figure it out open the text file in a binary editor and check the value  
of an accented character and compare it to the tables in the links above.

You can read and write these non UTF characters into a char[] etc,  
provided you don't use writef or any other routine that actually checks  
whether the characters are valid UTF, i.e. writeString works but writef  
will give an exception.

If you want to compare the data to static string you'll need to convert  
the data to UTF. I have a small module which will convert windows code  
page 1252 into UTF8, 16, and 32 and back again (tho the back again is  
totally untested). I needed it for much the same thing as you do. This  
code is public domain.

Regan

Aug 09 2005

jicman <jicman_member pathlink.com> writes:

WOW!  I thought everything was over about the subject, and then, I hit the
newsgroup again and BOOM, all of these nice responses. :-)  Thanks folks.

jic

Regan Heath says...
------------ZVb7U8x7kmqd0N1gRinYIz
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15
Content-Transfer-Encoding: 8bit

On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Stefan Zobel says...
 In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
 Thanks for the help, Stefan.  The problem is much deeper than that.  I  
 get input
 string from a file which contains certain words, say "ao�t" which I  
 need to test
 on.  The problem is that even though I have a test on the program,  
 which
 displays correctly, the test never matches because the input data,  
 "ao�t", and
 the test data inside the program, "ao�t", never match.

 Are you sure you're reading/interpreting the file as UTF8 (and it  
 actually is
 UTF8 encoded)?

 Well, here is a question: do I have to change the data that I work  
 with?  For example, I have this 200+ files with text data which contain  
 ASCII data with many different accented characters.  Do I need to change  
 this input data to UTF8  to be able to work with it?

Yes and No.

As Derek said, if it has characters above 127 it's not ascii, see:
   http://www.columbia.edu/kermit/csettables.html

I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1  
Latin Alphabet 1" which are very similar.

To figure it out open the text file in a binary editor and check the value  
of an accented character and compare it to the tables in the links above.

You can read and write these non UTF characters into a char[] etc,  
provided you don't use writef or any other routine that actually checks  
whether the characters are valid UTF, i.e. writeString works but writef  
will give an exception.

If you want to compare the data to static string you'll need to convert  
the data to UTF. I have a small module which will convert windows code  
page 1252 into UTF8, 16, and 32 and back again (tho the back again is  
totally untested). I needed it for much the same thing as you do. This  
code is public domain.

Regan
------------ZVb7U8x7kmqd0N1gRinYIz
Content-Disposition: attachment; filename=cp1252.d
Content-Type: application/octet-stream; name=cp1252.d
Content-Transfer-Encoding: 8bit

module cp1252;
import std.utf;

char[] cp1252toUTF8(ubyte[] raw)
{
	return toUTF8(cp1252toUTF16(raw));
}

wchar[] cp1252toUTF16(ubyte[] raw)
{
	wchar[] result;
	
	result.length = raw.length;
	foreach(int i, ubyte b; raw)
	{
		if (b <= 0x80) result[i] = b;
		else result[i] = table[b];
	}
	
	return result;
}

dchar[] cp1252toUTF32(ubyte[] raw)
{
	return toUTF32(cp1252toUTF16(raw));
}

ushort[] table = [
	0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008,
0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F, 
	0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018,
0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F, 
	0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028,
0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 
	0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038,
0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 
	0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048,
0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 
	0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058,
0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 
	0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068,
0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 
	0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078,
0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F, 
	0x20AC, 0x0000, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6,
0x2030, 0x0160, 0x2039, 0x0152, 0x0000, 0x017D, 0x0000, 
	0x0000, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC,
0x2122, 0x0161, 0x203A, 0x0153, 0x0000, 0x017E, 0x0178, 
	0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8,
0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 
	0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8,
0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 
	0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8,
0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 
	0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8,
0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 
	0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8,
0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 
	0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8,
0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF
];

/* UNTESTED! */
ubyte[] UTF16toCP1252(char[] raw)
{
	return UTF16toCP1252(toUTF16(raw));
}

ubyte[] UTF16toCP1252(wchar[] raw)
{
	ubyte[] result;
	
	result.length = raw.length;
	foreach(int i, wchar c; raw)
	{
		foreach(int j, ushort s; table)
		{
			if (c == s)
			{
				result[i] = table[j];
				goto found;
			}
		}
		throw new Exception("Data cannot be encoded");
	found:
		;
	}
	
	return result;
}

ubyte[] UTF16toCP1252(dchar[] raw)
{
	return UTF16toCP1252(toUTF16(raw));
}

------------ZVb7U8x7kmqd0N1gRinYIz--

Aug 10 2005

Carlos Santander <csantander619 gmail.com> writes:

jicman escribi�:
 Greetings!
 
 Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
 you may be able to help me jump it or break through it.
 
 I have this code,
 char[] GetMonthDigit(char[] mon)
 {
 char [][char[]] sMon;
 sMon["ao�t"] = "08";
 return sMon[mon];
 }
 
 if I call this from within the program
 
 char[] mon = GetMonthDigit("ao�t");
 
 mon will have "08", however if it comes from another text file, say an ASCII
 file, it fails.  So, if you take a look at this small piece of code,
 
 import std.stream;
 void main()
 {
 char[] fn = "test.log";
 char[] txt = "ao�t";
 File log = new File(fn,FileMode.Append);
 log.writeLine(txt);
 log.close();
 }
 
 this will create a file, if it does not exists, called test.log after compile
 and run, and it will write, supposely, the content of txt.  However, when you
 open the test.log file, the content of it is,
 
 août
 
 Hmmmm... I tried setting the editor settings to UTF8, and others, however,
 nothing has worked.  Any ideas how I can fix this?
 
 thanks,
 
 jos�
 
 

It must be something with the editor or with the console, because I just 
tried with gdc-0.13 on Mac and it worked ok.

-- 
Carlos Santander Bernal

Aug 09 2005

Carlos Santander <csantander619 gmail.com> writes:

Carlos Santander escribi�:
 
 It must be something with the editor or with the console, because I just 
 tried with gdc-0.13 on Mac and it worked ok.
 

That should've been 0.15

-- 
Carlos Santander Bernal

Aug 10 2005

D Programming

C/C++ Programming

Other

digitalmars.D.learn - UTF8 Encoding: Again