www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Char & the Extended ascii set

reply Era Scarecrow <rtcvb32 yahoo.com> writes:
 It there any support for the extended ascii characters? (128-255). I under=
stand unicode is important, however working with some data and programs tha=
t don't support those, I am getting a problem that the program causes an ex=
ception because it isn't valid utf-8. Do I have to handle it all as bytes/u=
bytes? If I do then I lose out on many char specific functions. Alternative=
ly I can rely on the C functions, but I want to avoid using them if I can.

Example: note the raw data below, being 39 vs -110

this._ID =3D "SPEL_wulfharth's cups"
rhs._ID  =3D "SPEL_wulfharth=E2=96=92s cups"

this._ID =3D [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 10=
4, 39, 115, 32, 99, 117, 112, 115, 0]
rhs._ID  =3D [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 10=
4, -110, 115, 32, 99, 117, 112, 115, 0]

I have compiled and made a table for the appropriate conversions to proper =
unicode, which you can then use in reverse to get it back to it's previous =
state. However I'm not sure.

//referenced from http://ascii-table.com/ascii-extended-pc-list.php
wchar[128] convertAsciiExtended =3D [
=090x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7,
=090x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5,
=090x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9,
=090x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192,
=090x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA,
=090x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB,
=090x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
=090x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
=090x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
=090x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
=090x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
=090x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
=090x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x00B5, 0x03C4,
=090x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B5, 0x2229,
=090x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248,
=090x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0];
Jan 28 2012
parent Era Scarecrow <rtcvb32 yahoo.com> writes:
 char is UTF-8 by definition, and D code is free to assume
 that that's the case. 
 A lot of the string processing code in Phobos will throw if
 you give it ill-
 formed unicode.
 Now, you can put whatever you want in a char, but don't
 expect other D code to 
 handle it correctly.
 The only support in Phobos for dealing with alternate
 encodings is 
 std.encoding. It currently supports "UTF-8, UTF-16, UTF-32,
 ASCII, ISO-8859-1
 (also known as LATIN-1), and WINDOWS-1252." So, if you can
 get that to do the 
 conversions that you want, then there you go, but otherwise
 you're on your 
 Regardless, you need to convert your chars to proper UTF-8
 if you want other D 
 code (and especially Phobos) to handle them correctly.
Yeah, and while I'm finding more often then not what is breaking the Unicode are likely duplicates and errors in the source file (at least 10 years old too). Based on the sparseness and rarity of the formatting getting in the way I've tried making a custom compare function that uses the phobos code, but catches the exception when the UTF is badly formatted, which then converts it and tries the compare again. The source format doesn't have everything as texts marked, rather it has to be taken in context when it is needed, so needlessly converting to proper unicode on everything will be a waste 75%-95% of the time.
Jan 28 2012