www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Encodings

reply "Nathan M. Swan" <nathanmswan gmail.com> writes:
For most of the string processing I do, I read/write text in 
UTF-8 and convert it to UTF-32 for processing (with std.utf), so 
I don't have to worry about encoding. Is this a good or bad 
paradigm? Is there a better way to do this? What method do all of 
you use?

Just curious, NMS
Apr 08 2012
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, April 08, 2012 23:36:23 Nathan M. Swan wrote:
 For most of the string processing I do, I read/write text in
 UTF-8 and convert it to UTF-32 for processing (with std.utf), so
 I don't have to worry about encoding. Is this a good or bad
 paradigm? Is there a better way to do this? What method do all of
 you use?
 
 Just curious, NMS
It depends on what you're doing. Depending on the functions that you use and your memory requirements, UTF-8 may be faster or UTF-32 may be faster. UTF-32 has the advantage of being a random-access range, which will make it work with a number of functions that UTF-8 won't work with. But UTF-32 also takes considerably more memory (especially if most of your characters are ASCII characters), which can be a problem. I think that the most common thing is to just operate on UTF-8 unless another encoding is needed (e.g. UTF-32 is required because random-access is needed), and in plenty of cases, you end up operating on generic ranges anyway if you use range-based functions on strings and don't use std.array.array on them. You're going to have to profile your code to see whether using UTF-8 or UTF-32 primarily in your string-processing is more efficient. - Jonathan M Davis
Apr 08 2012