digitalmars.D.learn - Re: Char & the Extended ascii set

Era Scarecrow <rtcvb32 yahoo.com> Jan 28 2012

Era Scarecrow <rtcvb32 yahoo.com> writes:

 char is UTF-8 by definition, and D code is free to assume
 that that's the case. 
 A lot of the string processing code in Phobos will throw if
 you give it ill-
 formed unicode.
 
 Now, you can put whatever you want in a char, but don't
 expect other D code to 
 handle it correctly.
 
 The only support in Phobos for dealing with alternate
 encodings is 
 std.encoding. It currently supports "UTF-8, UTF-16, UTF-32,
 ASCII, ISO-8859-1
 (also known as LATIN-1), and WINDOWS-1252." So, if you can
 get that to do the 
 conversions that you want, then there you go, but otherwise
 you're on your 
 own.
 
 Regardless, you need to convert your chars to proper UTF-8
 if you want other D 
 code (and especially Phobos) to handle them correctly.


 Yeah, and while I'm finding more often then not what is breaking the Unicode
are likely duplicates and errors in the source file (at least 10 years old too).

 Based on the sparseness and rarity of the formatting getting in the way I've
tried making a custom compare function that uses the phobos code, but catches
the exception when the UTF is badly formatted, which then converts it and tries
the compare again. The source format doesn't have everything as texts marked,
rather it has to be taken in context when it is needed, so needlessly
converting to proper unicode on everything will be a waste 75%-95% of the time.

Jan 28 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Re: Char & the Extended ascii set