www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - UTF8 Encoding: Again

reply jicman <jicman_member pathlink.com> writes:
Greetings!

Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
you may be able to help me jump it or break through it.

I have this code,
char[] GetMonthDigit(char[] mon)
{
char [][char[]] sMon;
sMon["août"] = "08";
return sMon[mon];
}

if I call this from within the program

char[] mon = GetMonthDigit("août");

mon will have "08", however if it comes from another text file, say an ASCII
file, it fails.  So, if you take a look at this small piece of code,

import std.stream;
void main()
{
char[] fn = "test.log";
char[] txt = "août";
File log = new File(fn,FileMode.Append);
log.writeLine(txt);
log.close();
}

this will create a file, if it does not exists, called test.log after compile
and run, and it will write, supposely, the content of txt.  However, when you
open the test.log file, the content of it is,

août

Hmmmm... I tried setting the editor settings to UTF8, and others, however,
nothing has worked.  Any ideas how I can fix this?

thanks,

josé
Aug 09 2005
next sibling parent reply Stefan <Stefan_member pathlink.com> writes:
 ...  However, when you
 open the test.log file, the content of it is,

août

Seems correct to me: 'août' is 61 6f c3 bb 74 in binary which will be interpreted as 'août' in ASCII mode. So I think the editor that you use to view the file is the problem. A lot of editors (e.g. Notepad) need a 'magic' byte order mark (BOM) in the beginning of the file to be able to recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is ef bb bf. Delete these with an hex editor and it'll give you the same output. Anyways, you have to make sure that you read/interpret your UTF8 file as UTF8, not ASCII. HTH, Stefan In article <ddb1bt$bcp$1 digitaldaemon.com>, jicman says...
Greetings!

Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
you may be able to help me jump it or break through it.

I have this code,
char[] GetMonthDigit(char[] mon)
{
char [][char[]] sMon;
sMon["août"] = "08";
return sMon[mon];
}

if I call this from within the program

char[] mon = GetMonthDigit("août");

mon will have "08", however if it comes from another text file, say an ASCII
file, it fails.  So, if you take a look at this small piece of code,

import std.stream;
void main()
{
char[] fn = "test.log";
char[] txt = "août";
File log = new File(fn,FileMode.Append);
log.writeLine(txt);
log.close();
}

this will create a file, if it does not exists, called test.log after compile
and run, and it will write, supposely, the content of txt.  However, when you
open the test.log file, the content of it is,

août

Hmmmm... I tried setting the editor settings to UTF8, and others, however,
nothing has worked.  Any ideas how I can fix this?

thanks,

josé

Aug 09 2005
parent reply Stefan <Stefan_member pathlink.com> writes:
In article <ddb5u6$fmn$1 digitaldaemon.com>, Stefan says...
 ...  However, when you
 open the test.log file, the content of it is,

août

... A lot of editors (e.g. Notepad) need a 'magic' byte order mark (BOM) in the beginning of the file to be able to recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is ef bb bf. Delete these with an hex editor and it'll give you the same output.

Just realized that Notepad on XP works even without the BOM now. You can reproduce it with WordPad, however. Any decent XML editor should be able to display correctly without BOMs. Best regards, Stefan
Aug 09 2005
parent reply jicman <jicman_member pathlink.com> writes:
Stefan says...
In article <ddb5u6$fmn$1 digitaldaemon.com>, Stefan says...
 ...  However, when you
 open the test.log file, the content of it is,

août

... A lot of editors (e.g. Notepad) need a 'magic' byte order mark (BOM) in the beginning of the file to be able to recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is ef bb bf. Delete these with an hex editor and it'll give you the same output.

Just realized that Notepad on XP works even without the BOM now. You can reproduce it with WordPad, however. Any decent XML editor should be able to display correctly without BOMs. Best regards, Stefan

Thanks for the help, Stefan. The problem is much deeper than that. I get input string from a file which contains certain words, say "août" which I need to test on. The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match. I guess I still don't understand UTF. :-o I will have to rewrite this whole check check function because of this. Thanks for the help. josé
Aug 09 2005
parent reply Stefan Zobel <Stefan_member pathlink.com> writes:
In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "août" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "août", and
the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)? Nevertheless, good luck! If there's something to learn from your investigations let us other D newbies know ;-) Best regards, Stefan
I guess I still don't understand UTF. :-o

I will have to rewrite this whole check check function because of this.

Thanks for the help.

josé

Aug 09 2005
parent reply jicman <jicman_member pathlink.com> writes:
Stefan Zobel says...
In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "août" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "août", and
the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it? I know that I have to save the source code files as UTF8, but do I also have to change the other text files that I work with to UTF8? That is my problem. I am saving the source files ok, but the input that I read from text files are not matching the the source code. Again, do I need to change that input data to UTF8?
 Nevertheless, good luck! If there's something to learn from
your investigations let us other D newbies know ;-)

There is nothing to learn. ;-) All I am going to do is to change any character higher than 127 to +. :-) That's how I have been able to work with this UTF stuff. :-) thanks, josé
Aug 09 2005
next sibling parent Derek Parnell <derek psych.ward> writes:
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman wrote:

 Stefan Zobel says...
In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "août" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "août", and
the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it? I know that I have to save the source code files as UTF8, but do I also have to change the other text files that I work with to UTF8? That is my problem. I am saving the source files ok, but the input that I read from text files are not matching the the source code. Again, do I need to change that input data to UTF8?

Technically, if it contains accented characters it is *not* ASCII. It is some other form of character encoding. For example, my Windows XP has Code Page 850 set for the DOS console. ( http://en.wikipedia.org/wiki/Code_page_850 ) You would need to find out which character encoding standard was used in your file, then read the file in as a stream of *bytes* not chars, and convert each of the byte values into the equivalent Unicode character. You could then use UTF8 "char[]", UTF16 "wchar[]", or UTF32 "dchar[]" as your preferred coding in your program. Also have a look at http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues for further help. -- Derek (skype: derek.j.parnell) Melbourne, Australia 10/08/2005 1:03:43 PM
Aug 09 2005
prev sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
------------ZVb7U8x7kmqd0N1gRinYIz
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15
Content-Transfer-Encoding: 8bit

On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Stefan Zobel says...
 In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
 Thanks for the help, Stefan.  The problem is much deeper than that.  I  
 get input
 string from a file which contains certain words, say "août" which I  
 need to test
 on.  The problem is that even though I have a test on the program,  
 which
 displays correctly, the test never matches because the input data,  
 "août", and
 the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it?

Yes and No. As Derek said, if it has characters above 127 it's not ascii, see: http://www.columbia.edu/kermit/csettables.html I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1 Latin Alphabet 1" which are very similar. To figure it out open the text file in a binary editor and check the value of an accented character and compare it to the tables in the links above. You can read and write these non UTF characters into a char[] etc, provided you don't use writef or any other routine that actually checks whether the characters are valid UTF, i.e. writeString works but writef will give an exception. If you want to compare the data to static string you'll need to convert the data to UTF. I have a small module which will convert windows code page 1252 into UTF8, 16, and 32 and back again (tho the back again is totally untested). I needed it for much the same thing as you do. This code is public domain. Regan ------------ZVb7U8x7kmqd0N1gRinYIz Content-Disposition: attachment; filename=cp1252.d Content-Type: application/octet-stream; name=cp1252.d Content-Transfer-Encoding: 8bit module cp1252; import std.utf; char[] cp1252toUTF8(ubyte[] raw) { return toUTF8(cp1252toUTF16(raw)); } wchar[] cp1252toUTF16(ubyte[] raw) { wchar[] result; result.length = raw.length; foreach(int i, ubyte b; raw) { if (b <= 0x80) result[i] = b; else result[i] = table[b]; } return result; } dchar[] cp1252toUTF32(ubyte[] raw) { return toUTF32(cp1252toUTF16(raw)); } ushort[] table = [ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F, 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F, 0x20AC, 0x0000, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x0000, 0x017D, 0x0000, 0x0000, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x0000, 0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF ]; /* UNTESTED! */ ubyte[] UTF16toCP1252(char[] raw) { return UTF16toCP1252(toUTF16(raw)); } ubyte[] UTF16toCP1252(wchar[] raw) { ubyte[] result; result.length = raw.length; foreach(int i, wchar c; raw) { foreach(int j, ushort s; table) { if (c == s) { result[i] = table[j]; goto found; } } throw new Exception("Data cannot be encoded"); found: ; } return result; } ubyte[] UTF16toCP1252(dchar[] raw) { return UTF16toCP1252(toUTF16(raw)); } ------------ZVb7U8x7kmqd0N1gRinYIz--
Aug 09 2005
parent jicman <jicman_member pathlink.com> writes:
WOW!  I thought everything was over about the subject, and then, I hit the
newsgroup again and BOOM, all of these nice responses. :-)  Thanks folks.

jic

Regan Heath says...
------------ZVb7U8x7kmqd0N1gRinYIz
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15
Content-Transfer-Encoding: 8bit

On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Stefan Zobel says...
 In article <ddb989$jkm$1 digitaldaemon.com>, jicman says...
 Thanks for the help, Stefan.  The problem is much deeper than that.  I  
 get input
 string from a file which contains certain words, say "août" which I  
 need to test
 on.  The problem is that even though I have a test on the program,  
 which
 displays correctly, the test never matches because the input data,  
 "août", and
 the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with? For example, I have this 200+ files with text data which contain ASCII data with many different accented characters. Do I need to change this input data to UTF8 to be able to work with it?

Yes and No. As Derek said, if it has characters above 127 it's not ascii, see: http://www.columbia.edu/kermit/csettables.html I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1 Latin Alphabet 1" which are very similar. To figure it out open the text file in a binary editor and check the value of an accented character and compare it to the tables in the links above. You can read and write these non UTF characters into a char[] etc, provided you don't use writef or any other routine that actually checks whether the characters are valid UTF, i.e. writeString works but writef will give an exception. If you want to compare the data to static string you'll need to convert the data to UTF. I have a small module which will convert windows code page 1252 into UTF8, 16, and 32 and back again (tho the back again is totally untested). I needed it for much the same thing as you do. This code is public domain. Regan ------------ZVb7U8x7kmqd0N1gRinYIz Content-Disposition: attachment; filename=cp1252.d Content-Type: application/octet-stream; name=cp1252.d Content-Transfer-Encoding: 8bit module cp1252; import std.utf; char[] cp1252toUTF8(ubyte[] raw) { return toUTF8(cp1252toUTF16(raw)); } wchar[] cp1252toUTF16(ubyte[] raw) { wchar[] result; result.length = raw.length; foreach(int i, ubyte b; raw) { if (b <= 0x80) result[i] = b; else result[i] = table[b]; } return result; } dchar[] cp1252toUTF32(ubyte[] raw) { return toUTF32(cp1252toUTF16(raw)); } ushort[] table = [ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F, 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F, 0x20AC, 0x0000, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x0000, 0x017D, 0x0000, 0x0000, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x0000, 0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF ]; /* UNTESTED! */ ubyte[] UTF16toCP1252(char[] raw) { return UTF16toCP1252(toUTF16(raw)); } ubyte[] UTF16toCP1252(wchar[] raw) { ubyte[] result; result.length = raw.length; foreach(int i, wchar c; raw) { foreach(int j, ushort s; table) { if (c == s) { result[i] = table[j]; goto found; } } throw new Exception("Data cannot be encoded"); found: ; } return result; } ubyte[] UTF16toCP1252(dchar[] raw) { return UTF16toCP1252(toUTF16(raw)); } ------------ZVb7U8x7kmqd0N1gRinYIz--

Aug 10 2005
prev sibling parent reply Carlos Santander <csantander619 gmail.com> writes:
jicman escribió:
 Greetings!
 
 Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
 you may be able to help me jump it or break through it.
 
 I have this code,
 char[] GetMonthDigit(char[] mon)
 {
 char [][char[]] sMon;
 sMon["août"] = "08";
 return sMon[mon];
 }
 
 if I call this from within the program
 
 char[] mon = GetMonthDigit("août");
 
 mon will have "08", however if it comes from another text file, say an ASCII
 file, it fails.  So, if you take a look at this small piece of code,
 
 import std.stream;
 void main()
 {
 char[] fn = "test.log";
 char[] txt = "août";
 File log = new File(fn,FileMode.Append);
 log.writeLine(txt);
 log.close();
 }
 
 this will create a file, if it does not exists, called test.log after compile
 and run, and it will write, supposely, the content of txt.  However, when you
 open the test.log file, the content of it is,
 
 août
 
 Hmmmm... I tried setting the editor settings to UTF8, and others, however,
 nothing has worked.  Any ideas how I can fix this?
 
 thanks,
 
 josé
 
 

It must be something with the editor or with the console, because I just tried with gdc-0.13 on Mac and it worked ok. -- Carlos Santander Bernal
Aug 09 2005
parent Carlos Santander <csantander619 gmail.com> writes:
Carlos Santander escribió:
 
 It must be something with the editor or with the console, because I just 
 tried with gdc-0.13 on Mac and it worked ok.
 

That should've been 0.15 -- Carlos Santander Bernal
Aug 10 2005