digitalmars.D - Re: Wide characters support in D
Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Just one more addition: it is possible to have built-in function that conve= rts multibyte (or multiword) char sequence (even though in my proposal it c= an be of different size) to dchar (UTF-32) character. Again, my only point = is that it would be nice to have something similar to TCHAR so that all lib= raries can use it if they choose not to provide functions for all 3 types.= =0A=0A2Walter:=0AYes, programmers do often ignore surrogate pairs in case o= f UTF-16. But in case of undetermined char size (1 or 2 bytes) they will ha= ve to use special builtin conversion functions to dchar unless they want th= eir code to be completely broken.=0A=0AThanks,=0ARuslan. =0A=0A--- On Tue, = 6/8/10, Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:=0A=0A> From: Rusla= n Nikolaev <nruslan_devel yahoo.com>=0A> Subject: Re: Wide characters suppo= rt in D=0A> To: "digitalmars.D" <digitalmars-d puremagic.com>=0A> Date: Tue= sday, June 8, 2010, 3:16 AM=0A> Ok, ok... that was just a=0A> suggestion...= Thanks, for reply about "Hello world"=0A> representation. Was postfix "w" = and "d" added initially or=0A> just recently? I did not know about it. I th= ought D does=0A> automatic conversion for string literals.=0A> =0A> Yes, te= mplates may help. However, that unnecessary make=0A> code bigger (since we = have to compile it for every char=0A> type). The other problem is that it a= llows programmer to=0A> choose which one to use. He or she may just prefer = char as=0A> UTF-8 (or wchar as UTF-16). That will be fine on platform= =0A> that supports this encoding natively (e.g. for file system=0A> operati= ons, screen output, etc.), whereas it will cause=0A> conversion overhead on= the other. Not to say that it's a big=0A> overhead, but unnecessary one. H= aving said this, I do agree=0A> that there must be some flexibility (e.g. i= n Java char is=0A> always 2 bytes), however, I don't believe that this=0A=flexibility should be available for application programmer.=0A> =0A> I do=
act, that would make programs better=0A> (since application programmers wil= l have to think in terms=0A> of characters as opposed to bytes). System pro= grammers (i.e.=0A> OS programmers) may choose to think as they expect it to= be=0A> (since char width option can be added to compiler). TCHAR in=0A> Wi= ndows is a good example of it. Whenever you need to=0A> determine size of e= lement (e.g. for allocation), you can use=0A> 'sizeof'. Again, it does not = mean that you're deprived of=0A> char/wchar/dchar capability. It still can = be supported (e.g.=0A> via ubyte/ushort/uint) for the sake of interoperabil= ity or=0A> some special cases. Special string constants (e.g. ""b, ""w,=0A>= ""d) can be supported, too. My only point is that it would=0A> be good to = have universal char type that depends on=0A> platform. That, in turns, allo= ws to have unified char for=0A> all libraries on this platform.=0A> =0A> In= addition, commonly used constants '\n', '\r', '\t' will=0A> be the same re= gardless of char width.=0A> =0A> Anyway, that was just a suggestion. You ma= y disagree with=0A> this if you wish.=0A> =0A> Ruslan.=0A> =0A> =0A> =A0 = =A0 =A0 =0A> =0A=0A=0A
Jun 07 2010
Walter Bright <newshound1 digitalmars.com> writes:
Ruslan Nikolaev wrote:Just one more addition: it is possible to have built-in function that converts multibyte (or multiword) char sequence (even though in my proposal it can be of different size) to dchar (UTF-32) character. Again, my only point is that it would be nice to have something similar to TCHAR so that all libraries can use it if they choose not to provide functions for all 3 types. 2Walter: Yes, programmers do often ignore surrogate pairs in case of UTF-16. But in case of undetermined char size (1 or 2 bytes) they will have to use special builtin conversion functions to dchar unless they want their code to be completely broken.
The nice thing about char is that you'll find out real fast if your multibyte code is broken. With surrogate pairs in wchar, the bug may lurk undetected for a decade.
Jun 07 2010