www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - ASCII to UTF conversion?

reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
Maybe I missed something in the D Docs, but is there a way to convert from 
ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware 
functions (like those in some libraries), when they return ASCII strings 
that have characters above 0x7F.  All it ends me up with is heartache and 
"4Invalid UTF-8 Sequence" exceptions.

So is there a standard function for doing this, or would I just be better 
off looping through the string and replacing any above-0x7F characters with 
underscores or something? 
Nov 28 2005
next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Jarrett Billingsley wrote:

 Maybe I missed something in the D Docs, but is there a way to convert from 
 ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware 
 functions (like those in some libraries), when they return ASCII strings 
 that have characters above 0x7F.  All it ends me up with is heartache and 
 "4Invalid UTF-8 Sequence" exceptions.

You need to find out which encoding that your non-UTF functions return. Hint: it's not ASCII, as that is a 7-bit encoding compatible with UTF-8
 So is there a standard function for doing this, or would I just be better 
 off looping through the string and replacing any above-0x7F characters with 
 underscores or something? 

There are no functions in Phobos (as far as I know), but libiconv works. See: http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs ("8 bit enc.") --anders
Nov 29 2005
prev sibling next sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Jarrett Billingsley wrote:
 Maybe I missed something in the D Docs, but is there a way to convert from 
 ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware 
 functions (like those in some libraries), when they return ASCII strings 
 that have characters above 0x7F.  All it ends me up with is heartache and 
 "4Invalid UTF-8 Sequence" exceptions.
 
 So is there a standard function for doing this, or would I just be better 
 off looping through the string and replacing any above-0x7F characters with 
 underscores or something? 

ASCII to UTF-8 is simple: # char[] ascii2utf(ubyte[] ascii) { return cast(char[]) ascii; } But by mentioning characters above 0x7F, I assume you mean something else than ASCII... Here is a simple Latin-1 to UTF-16 converter: # wchar[] latin12utf16(ubyte[] latin1) { # wchar[] ret; # ret.length = latin1.length; # foreach(int i, ubyte b; latin1) # ret[i] = cast(wchar) b; # return ret; # } (Disclaimer: no code is tested.) For 8-bit character sets other than Latin-1 (ISO 8859-1) you will need a library to supply the mapping. (Unicode's lower 256 code points map 1:1 to Latin-1) /Oskar
Nov 29 2005
prev sibling next sibling parent "Walter Bright" <newshound digitalmars.com> writes:
"Jarrett Billingsley" <kb3ctd2 yahoo.com> wrote in message
news:dmgmc4$hed$1 digitaldaemon.com...
 Maybe I missed something in the D Docs, but is there a way to convert from
 ASCII to UTF?  Sometimes problems arise when dealing with non-UTF-aware
 functions (like those in some libraries), when they return ASCII strings
 that have characters above 0x7F.  All it ends me up with is heartache and
 "4Invalid UTF-8 Sequence" exceptions.

 So is there a standard function for doing this, or would I just be better
 off looping through the string and replacing any above-0x7F characters

 underscores or something?

You can try the functions in std.charset.
Nov 29 2005
prev sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Jarrett Billingsley" <kb3ctd2 yahoo.com> wrote in message 
news:dmgmc4$hed$1 digitaldaemon.com...

Thanks for the replies!  Walter's suggestion is what I was looking for - 
totally missed those functions.

And yes, I suppose I meant "Latin 1."  I didn't realize that the formal 
definition of ASCII was still so strict as to mean just the characters 
between 0x0 and 0x7F; for me, characters between 0x0 and 0xFF have always 
been "ASCII."  I guess that's what happens when you only have five years of 
programming experience. 
Nov 29 2005