digitalmars.D - Re: Wide characters support in D
- Ruslan Nikolaev <nruslan_devel yahoo.com> Jun 07 2010
- Walter Bright <newshound1 digitalmars.com> Jun 07 2010
Just one more addition: it is possible to have built-in function that conve= rts multibyte (or multiword) char sequence (even though in my proposal it c= an be of different size) to dchar (UTF-32) character. Again, my only point = is that it would be nice to have something similar to TCHAR so that all lib= raries can use it if they choose not to provide functions for all 3 types.= =0A=0A2Walter:=0AYes, programmers do often ignore surrogate pairs in case o= f UTF-16. But in case of undetermined char size (1 or 2 bytes) they will ha= ve to use special builtin conversion functions to dchar unless they want th= eir code to be completely broken.=0A=0AThanks,=0ARuslan. =0A=0A--- On Tue, = 6/8/10, Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:=0A=0A> From: Rusla= n Nikolaev <nruslan_devel yahoo.com>=0A> Subject: Re: Wide characters suppo= rt in D=0A> To: "digitalmars.D" <digitalmars-d puremagic.com>=0A> Date: Tue= sday, June 8, 2010, 3:16 AM=0A> Ok, ok... that was just a=0A> suggestion...= Thanks, for reply about "Hello world"=0A> representation. Was postfix "w" = and "d" added initially or=0A> just recently? I did not know about it. I th= ought D does=0A> automatic conversion for string literals.=0A> =0A> Yes, te= mplates may help. However, that unnecessary make=0A> code bigger (since we = have to compile it for every char=0A> type). The other problem is that it a= llows programmer to=0A> choose which one to use. He or she may just prefer = char[] as=0A> UTF-8 (or wchar[] as UTF-16). That will be fine on platform= =0A> that supports this encoding natively (e.g. for file system=0A> operati= ons, screen output, etc.), whereas it will cause=0A> conversion overhead on= the other. Not to say that it's a big=0A> overhead, but unnecessary one. H= aving said this, I do agree=0A> that there must be some flexibility (e.g. i= n Java char[] is=0A> always 2 bytes), however, I don't believe that this=0A=flexibility should be available for application programmer.=0A> =0A> I do=
act, that would make programs better=0A> (since application programmers wil= l have to think in terms=0A> of characters as opposed to bytes). System pro= grammers (i.e.=0A> OS programmers) may choose to think as they expect it to= be=0A> (since char width option can be added to compiler). TCHAR in=0A> Wi= ndows is a good example of it. Whenever you need to=0A> determine size of e= lement (e.g. for allocation), you can use=0A> 'sizeof'. Again, it does not = mean that you're deprived of=0A> char/wchar/dchar capability. It still can = be supported (e.g.=0A> via ubyte/ushort/uint) for the sake of interoperabil= ity or=0A> some special cases. Special string constants (e.g. ""b, ""w,=0A>= ""d) can be supported, too. My only point is that it would=0A> be good to = have universal char type that depends on=0A> platform. That, in turns, allo= ws to have unified char for=0A> all libraries on this platform.=0A> =0A> In= addition, commonly used constants '\n', '\r', '\t' will=0A> be the same re= gardless of char width.=0A> =0A> Anyway, that was just a suggestion. You ma= y disagree with=0A> this if you wish.=0A> =0A> Ruslan.=0A> =0A> =0A> =A0 = =A0 =A0 =0A> =0A=0A=0A
Jun 07 2010
Ruslan Nikolaev wrote:Just one more addition: it is possible to have built-in function that converts multibyte (or multiword) char sequence (even though in my proposal it can be of different size) to dchar (UTF-32) character. Again, my only point is that it would be nice to have something similar to TCHAR so that all libraries can use it if they choose not to provide functions for all 3 types. 2Walter: Yes, programmers do often ignore surrogate pairs in case of UTF-16. But in case of undetermined char size (1 or 2 bytes) they will have to use special builtin conversion functions to dchar unless they want their code to be completely broken.
The nice thing about char[] is that you'll find out real fast if your multibyte code is broken. With surrogate pairs in wchar[], the bug may lurk undetected for a decade.
Jun 07 2010