www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - To wchar or not to wchar?

reply "John C" <johnch_atms hotmail.com> writes:
Which would be the best string type to use - char[], wchar[] or dchar[]? I 
want to choose one of them and stick with it throughout my code for the sake 
of consistency. My preference would be for wchar[] but using it is not as 
smooth as I'd hoped. For example, Object.toString() returns char[], Phobos 
seems not to have wchar versions for integer-to-string conversions, and 
concatenating sometimes requires casts. It's not too bad, I suppose: I can 
use free functions to encode/decode strings and write my own integer 
conversion routines. But I am puzzled as to why I need to cast when 
concatenating, e.g:

    wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ 
cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;

Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 
natively, so it seemed sane to choose the equivalent string type in D. Plus 
I read here http://www.digitalmars.com/techtips/windows_utf.html that char[] 
is not directly compatible with the ANSI versions of the Windows API (again, 
I'm using this a lot).

Given the above considerations, which do you advise I go with?

Cheers,
John.

P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 version 
identifiers which we could use on the command line to tell the compiler to 
expect that string type as the default (e.g., "version=UTF16"). It would 
then mean that char[] becomes an alias for the specified type. When the type 
is not specified, char[] goes back to being UTF8. 
Mar 09 2005
parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
 Which would be the best string type to use - char[], wchar[] or dchar[]? I 
 want to choose one of them and stick with it throughout my code for the 
 sake of consistency. My preference would be for wchar[] but using it is 
 not as smooth as I'd hoped. For example, Object.toString() returns char[], 
 Phobos seems not to have wchar versions for integer-to-string conversions, 
 and concatenating sometimes requires casts. It's not too bad, I suppose: I 
 can use free functions to encode/decode strings and write my own integer 
 conversion routines. But I am puzzled as to why I need to cast when 
 concatenating, e.g:

    wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ 
 cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;

 Anyway, I'm doing a lot of text processing on Windows XP, which uses UTF16 
 natively, so it seemed sane to choose the equivalent string type in D. 
 Plus I read here http://www.digitalmars.com/techtips/windows_utf.html that 
 char[] is not directly compatible with the ANSI versions of the Windows 
 API (again, I'm using this a lot).
You've pretty much summed up all the pros and cons. XP uses wchars natively, but Phobos is not too kind to them. I just use char[] as I'm not planning on translating my programs into languages which use non-roman alphabets any time soon ;)
 P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 
 version identifiers which we could use on the command line to tell the 
 compiler to expect that string type as the default (e.g., 
 "version=UTF16"). It would then mean that char[] becomes an alias for the 
 specified type. When the type is not specified, char[] goes back to being 
 UTF8.
Don't know about it being a language feature, but perhaps something that could be added to the runtime. Something like a conditional alias that would define a type like "nchar" to mean "native char".
Mar 09 2005
parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
 I just use char[] as I'm not planning on translating my programs into 
 languages which use non-roman alphabets any time soon ;)
You'll be suprised but even Latin-1 set does not fit into the char. http://www.bbsinc.com/symbol.html For example if you will not use wchar you will not be able to see e.g. Euro sign as one char - two UTF-8 bytes. :) "Anders F Bjorklund" name will also be represented with one byte more. ,etc. "Jarrett Billingsley" <kb3ctd2 yahoo.com> wrote in message news:d0odnf$svr$1 digitaldaemon.com...
 Which would be the best string type to use - char[], wchar[] or dchar[]? 
 I want to choose one of them and stick with it throughout my code for the 
 sake of consistency. My preference would be for wchar[] but using it is 
 not as smooth as I'd hoped. For example, Object.toString() returns 
 char[], Phobos seems not to have wchar versions for integer-to-string 
 conversions, and concatenating sometimes requires casts. It's not too 
 bad, I suppose: I can use free functions to encode/decode strings and 
 write my own integer conversion routines. But I am puzzled as to why I 
 need to cast when concatenating, e.g:

    wchar[] text = cast(wchar[])"The quick brown " ~ quickAnimalStr ~ 
 cast(wchar[])" jumped over the lazy " ~ lazyAnimalStr;

 Anyway, I'm doing a lot of text processing on Windows XP, which uses 
 UTF16 natively, so it seemed sane to choose the equivalent string type in 
 D. Plus I read here http://www.digitalmars.com/techtips/windows_utf.html 
 that char[] is not directly compatible with the ANSI versions of the 
 Windows API (again, I'm using this a lot).
You've pretty much summed up all the pros and cons. XP uses wchars natively, but Phobos is not too kind to them. I just use char[] as I'm not planning on translating my programs into languages which use non-roman alphabets any time soon ;)
 P.S. Here's an idea: perhaps Walter could add UTF8, UTF16 and UTF32 
 version identifiers which we could use on the command line to tell the 
 compiler to expect that string type as the default (e.g., 
 "version=UTF16"). It would then mean that char[] becomes an alias for the 
 specified type. When the type is not specified, char[] goes back to being 
 UTF8.
Don't know about it being a language feature, but perhaps something that could be added to the runtime. Something like a conditional alias that would define a type like "nchar" to mean "native char".
Mar 09 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Andrew Fedoniouk wrote:

I just use char[] as I'm not planning on translating my programs into 
languages which use non-roman alphabets any time soon ;)
You'll be suprised but even Latin-1 set does not fit into the char.
He probably meant "non-US" ? (lone chars holds US-ASCII characters)
 For example if you will not use wchar you will not be able to see e.g.
 Euro sign as one char - two UTF-8 bytes.
Three, actually: char[1] euro = "\u20AC";
 cannot implicitly convert expression "\u20ac" of type char[3] to char[1]
http://www.fileformat.info/info/unicode/char/20ac/index.htm Some characters are even 4.
 :) "Anders F Bjorklund" name will also be represented with one byte more.
It actually messed up GDC, my name was added in Latin-1 in a comment... (when I added the patch to DMD that actually made it check comments too) It's even more fun when using .length, as it returns bytes (code units) I use char[] and dchar, myself. (and not wchar[] and wchar[], like Java) --anders
Mar 10 2005
parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
 He probably meant "non-US" ? (lone chars holds US-ASCII characters)
That's it. It seems weird though that such a (relatively) common letter as umlaut-o would be represented as 2 bytes in UTF8. Maybe I'm thinking of ASCII (and not the old kind, where chars 128-255 are lines and stuff).
Mar 10 2005