www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Wide characters support in D

reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Just one more addition: it is possible to have built-in function that conve=
rts multibyte (or multiword) char sequence (even though in my proposal it c=
an be of different size) to dchar (UTF-32) character. Again, my only point =
is that it would be nice to have something similar to TCHAR so that all lib=
raries can use it if they choose not to provide functions for all 3 types.=
=0A=0A2Walter:=0AYes, programmers do often ignore surrogate pairs in case o=
f UTF-16. But in case of undetermined char size (1 or 2 bytes) they will ha=
ve to use special builtin conversion functions to dchar unless they want th=
eir code to be completely broken.=0A=0AThanks,=0ARuslan. =0A=0A--- On Tue, =
6/8/10, Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:=0A=0A> From: Rusla=
n Nikolaev <nruslan_devel yahoo.com>=0A> Subject: Re: Wide characters suppo=
rt in D=0A> To: "digitalmars.D" <digitalmars-d puremagic.com>=0A> Date: Tue=
sday, June 8, 2010, 3:16 AM=0A> Ok, ok... that was just a=0A> suggestion...=
 Thanks, for reply about "Hello world"=0A> representation. Was postfix "w" =
and "d" added initially or=0A> just recently? I did not know about it. I th=
ought D does=0A> automatic conversion for string literals.=0A> =0A> Yes, te=
mplates may help. However, that unnecessary make=0A> code bigger (since we =
have to compile it for every char=0A> type). The other problem is that it a=
llows programmer to=0A> choose which one to use. He or she may just prefer =
char[] as=0A> UTF-8 (or wchar[] as UTF-16). That will be fine on platform=
=0A> that supports this encoding natively (e.g. for file system=0A> operati=
ons, screen output, etc.), whereas it will cause=0A> conversion overhead on=
 the other. Not to say that it's a big=0A> overhead, but unnecessary one. H=
aving said this, I do agree=0A> that there must be some flexibility (e.g. i=
n Java char[] is=0A> always 2 bytes), however, I don't believe that this=0A=
 flexibility should be available for application programmer.=0A> =0A> I do=

act, that would make programs better=0A> (since application programmers wil= l have to think in terms=0A> of characters as opposed to bytes). System pro= grammers (i.e.=0A> OS programmers) may choose to think as they expect it to= be=0A> (since char width option can be added to compiler). TCHAR in=0A> Wi= ndows is a good example of it. Whenever you need to=0A> determine size of e= lement (e.g. for allocation), you can use=0A> 'sizeof'. Again, it does not = mean that you're deprived of=0A> char/wchar/dchar capability. It still can = be supported (e.g.=0A> via ubyte/ushort/uint) for the sake of interoperabil= ity or=0A> some special cases. Special string constants (e.g. ""b, ""w,=0A>= ""d) can be supported, too. My only point is that it would=0A> be good to = have universal char type that depends on=0A> platform. That, in turns, allo= ws to have unified char for=0A> all libraries on this platform.=0A> =0A> In= addition, commonly used constants '\n', '\r', '\t' will=0A> be the same re= gardless of char width.=0A> =0A> Anyway, that was just a suggestion. You ma= y disagree with=0A> this if you wish.=0A> =0A> Ruslan.=0A> =0A> =0A> =A0 = =A0 =A0 =0A> =0A=0A=0A
Jun 07 2010
parent Walter Bright <newshound1 digitalmars.com> writes:
Ruslan Nikolaev wrote:
 Just one more addition: it is possible to have built-in function that
 converts multibyte (or multiword) char sequence (even though in my proposal
 it can be of different size) to dchar (UTF-32) character. Again, my only
 point is that it would be nice to have something similar to TCHAR so that all
 libraries can use it if they choose not to provide functions for all 3 types.
 
 
 2Walter: Yes, programmers do often ignore surrogate pairs in case of UTF-16.
 But in case of undetermined char size (1 or 2 bytes) they will have to use
 special builtin conversion functions to dchar unless they want their code to
 be completely broken.

The nice thing about char[] is that you'll find out real fast if your multibyte code is broken. With surrogate pairs in wchar[], the bug may lurk undetected for a decade.
Jun 07 2010