www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Re: Char & the Extended ascii set

 char is UTF-8 by definition, and D code is free to assume
 that that's the case. 
 A lot of the string processing code in Phobos will throw if
 you give it ill-
 formed unicode.
 
 Now, you can put whatever you want in a char, but don't
 expect other D code to 
 handle it correctly.
 
 The only support in Phobos for dealing with alternate
 encodings is 
 std.encoding. It currently supports "UTF-8, UTF-16, UTF-32,
 ASCII, ISO-8859-1
 (also known as LATIN-1), and WINDOWS-1252." So, if you can
 get that to do the 
 conversions that you want, then there you go, but otherwise
 you're on your 
 own.
 
 Regardless, you need to convert your chars to proper UTF-8
 if you want other D 
 code (and especially Phobos) to handle them correctly.

Yeah, and while I'm finding more often then not what is breaking the Unicode are likely duplicates and errors in the source file (at least 10 years old too). Based on the sparseness and rarity of the formatting getting in the way I've tried making a custom compare function that uses the phobos code, but catches the exception when the UTF is badly formatted, which then converts it and tries the compare again. The source format doesn't have everything as texts marked, rather it has to be taken in context when it is needed, so needlessly converting to proper unicode on everything will be a waste 75%-95% of the time.
Jan 28 2012