digitalmars.D.learn - ASCII to UTF8 Conversion

digitalmars.D.learn - ASCII to UTF8 Conversion - is this right?

Pragma (30/30) Dec 18 2006 Here's something that came up recently. As some of you may already

Oskar Linde (13/45) Dec 18 2006 First, ASCII is a 7 bit encoding that only defines characters <= 0x7f.

=?windows-1252?Q?Jari-Matti_M=E4kel=E4?= (3/7) Dec 18 2006 Some European sites/users also use ISO-8859-15. I think it might have

Pragma (5/13) Dec 18 2006 Ah. Good to know. I'll take that into consideration as well.

Pragma (10/56) Dec 18 2006 Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was

Georg Wrede (6/68) Dec 19 2006 You might also want to look at the message headers:

Pragma <ericanderton yahoo.removeme.com> writes:

Here's something that came up recently.  As some of you may already 
know, I've been doing some work with forum data recently.

I wanted to move some old forum data, which was stored in ASCII over to 
UTF8 via D.  The problem is that some of the data has characters in the 
0x80-0xff range, which causes UTF-BOM detection to fail.

So I rolled the following function to 'transcode' these characters:

char[] ASCII2UTF8(char[] value){
	char[] result;
	for(uint i=0; i<value.length; i++){
		char ch = value[i];
		if(ch < 0x80){
			result ~= ch;
		}
		else{
			result ~= 0xC0  | (ch >> 6);
			result ~= 0x80  | (ch & 0x3F);
			
			debug writefln("converted: %0.2X to %0.2X %0.2X",ch, result[$-2], 
result[$-1]);
		}
	}
	return result;
}

So my question is, while this conversion is done against a literal 
interpretation of the UTF-8 spec: is this the correct way to treat these 
characters?

Should I be taking user locale into account?  Are high-ASCII chars 
considered to be universal?

-- 
- EricAnderton at yahoo

Dec 18 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Pragma wrote:
 Here's something that came up recently.  As some of you may already 
 know, I've been doing some work with forum data recently.
 
 I wanted to move some old forum data, which was stored in ASCII over to 
 UTF8 via D.  The problem is that some of the data has characters in the 
 0x80-0xff range, which causes UTF-BOM detection to fail.
 
 So I rolled the following function to 'transcode' these characters:
 
 char[] ASCII2UTF8(char[] value){
     char[] result;
     for(uint i=0; i<value.length; i++){
         char ch = value[i];
         if(ch < 0x80){
             result ~= ch;
         }
         else{
             result ~= 0xC0  | (ch >> 6);
             result ~= 0x80  | (ch & 0x3F);
            
             debug writefln("converted: %0.2X to %0.2X %0.2X",ch, 
 result[$-2], result[$-1]);
         }
     }
     return result;
 }
 
 So my question is, while this conversion is done against a literal 
 interpretation of the UTF-8 spec: is this the correct way to treat these 
 characters?

First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. 
The encoding of the upper 128 bytes are locale dependent and can not be 
called "ASCII". There are numerous different encodings used for the 
upper 128 code points.

The above is correct if the source text is in Latin1 (ISO-8859-1) 
coding. This is probably the most common single byte encoding for 
Western Europe and the US. The windows english standard charset 1252 is 
a superset of latin1 and defines the range 0x80-0x9f differently.

 Should I be taking user locale into account?  Are high-ASCII chars 
 considered to be universal?

Rename the function Latin12UTF8 and you have something that behaves 
correctly according to spec. :)

Best regards,

/Oskar

Dec 18 2006

=?windows-1252?Q?Jari-Matti_M=E4kel=E4?= <jmjmak utu.fi.invalid> writes:

Oskar Linde wrote:
 The above is correct if the source text is in Latin1 (ISO-8859-1)
 coding. This is probably the most common single byte encoding for
 Western Europe and the US. The windows english standard charset 1252 is
 a superset of latin1 and defines the range 0x80-0x9f differently.

Some European sites/users also use ISO-8859-15. I think it might have
the euro (�) sign and some minor other differences too.

Dec 18 2006

Pragma <ericanderton yahoo.removeme.com> writes:

Jari-Matti M�kel� wrote:
 Oskar Linde wrote:
 The above is correct if the source text is in Latin1 (ISO-8859-1)
 coding. This is probably the most common single byte encoding for
 Western Europe and the US. The windows english standard charset 1252 is
 a superset of latin1 and defines the range 0x80-0x9f differently.

 
 Some European sites/users also use ISO-8859-15. I think it might have
 the euro (�) sign and some minor other differences too.

Ah.  Good to know.  I'll take that into consideration as well.

Thanks!

-- 
- EricAnderton at yahoo

Dec 18 2006

Pragma <ericanderton yahoo.removeme.com> writes:

Oskar Linde wrote:
 Pragma wrote:
 Here's something that came up recently.  As some of you may already 
 know, I've been doing some work with forum data recently.

 I wanted to move some old forum data, which was stored in ASCII over 
 to UTF8 via D.  The problem is that some of the data has characters in 
 the 0x80-0xff range, which causes UTF-BOM detection to fail.

 So I rolled the following function to 'transcode' these characters:

 char[] ASCII2UTF8(char[] value){
     char[] result;
     for(uint i=0; i<value.length; i++){
         char ch = value[i];
         if(ch < 0x80){
             result ~= ch;
         }
         else{
             result ~= 0xC0  | (ch >> 6);
             result ~= 0x80  | (ch & 0x3F);
                        debug writefln("converted: %0.2X to %0.2X 
 %0.2X",ch, result[$-2], result[$-1]);
         }
     }
     return result;
 }

 So my question is, while this conversion is done against a literal 
 interpretation of the UTF-8 spec: is this the correct way to treat 
 these characters?

 
 First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. 
 The encoding of the upper 128 bytes are locale dependent and can not be 
 called "ASCII". There are numerous different encodings used for the 
 upper 128 code points.

Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was 
taken for lack of a better title.  Admittedly, it's a misnomer. Same 
goes for my use of "high-ASCII".

 
 The above is correct if the source text is in Latin1 (ISO-8859-1) 
 coding. This is probably the most common single byte encoding for 
 Western Europe and the US. The windows english standard charset 1252 is 
 a superset of latin1 and defines the range 0x80-0x9f differently.
 
 Should I be taking user locale into account?  Are high-ASCII chars 
 considered to be universal?

 
 Rename the function Latin12UTF8 and you have something that behaves 
 correctly according to spec. :)

Makes sense to me.  If I can't find a way to determine what codepage 
users are using in the forum for non-Latin1 posts, I'll just try Latin-1 
and see what happens.

Thanks!

-- 
- EricAnderton at yahoo

Dec 18 2006

Georg Wrede <georg nospam.org> writes:

Pragma wrote:
 Oskar Linde wrote:
 
 Pragma wrote:

 Here's something that came up recently.  As some of you may already 
 know, I've been doing some work with forum data recently.

 I wanted to move some old forum data, which was stored in ASCII over 
 to UTF8 via D.  The problem is that some of the data has characters 
 in the 0x80-0xff range, which causes UTF-BOM detection to fail.

 So I rolled the following function to 'transcode' these characters:

 char[] ASCII2UTF8(char[] value){
     char[] result;
     for(uint i=0; i<value.length; i++){
         char ch = value[i];
         if(ch < 0x80){
             result ~= ch;
         }
         else{
             result ~= 0xC0  | (ch >> 6);
             result ~= 0x80  | (ch & 0x3F);
                        debug writefln("converted: %0.2X to %0.2X 
 %0.2X",ch, result[$-2], result[$-1]);
         }
     }
     return result;
 }

 So my question is, while this conversion is done against a literal 
 interpretation of the UTF-8 spec: is this the correct way to treat 
 these characters?


 First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. 
 The encoding of the upper 128 bytes are locale dependent and can not 
 be called "ASCII". There are numerous different encodings used for the 
 upper 128 code points.

 
 
 Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was 
 taken for lack of a better title.  Admittedly, it's a misnomer. Same 
 goes for my use of "high-ASCII".
 
 The above is correct if the source text is in Latin1 (ISO-8859-1) 
 coding. This is probably the most common single byte encoding for 
 Western Europe and the US. The windows english standard charset 1252 
 is a superset of latin1 and defines the range 0x80-0x9f differently.

 Should I be taking user locale into account?  Are high-ASCII chars 
 considered to be universal?


 Rename the function Latin12UTF8 and you have something that behaves 
 correctly according to spec. :)

 
 
 Makes sense to me.  If I can't find a way to determine what codepage 
 users are using in the forum for non-Latin1 posts, I'll just try Latin-1 
 and see what happens.

You might also want to look at the message headers:

Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

Especially the Content-Type header often tells you directly what the 
coding is.

Dec 19 2006

D Programming

C/C++ Programming

Other

digitalmars.D.learn - ASCII to UTF8 Conversion - is this right?