digitalmars.D - C strings - byte, ubyte or char? (Discussion from Bugzilla)
- Stewart Gordon (45/61) Oct 04 2007 This bug report
- Matti Niemenmaa (17/36) Oct 04 2007 Good idea. But note that I'm not talking only about C string-processing
-
Stewart Gordon
(14/35)
Oct 04 2007
"Matti Niemenmaa"
wrote in message - Matti Niemenmaa (7/19) Oct 04 2007 You're right. I was going to write a long answer involving accidentally ...
- Regan Heath (9/17) Oct 04 2007 ...
- Frits van Bommel (2/5) Oct 05 2007 IIRC MessageBoxW takes its string in UTF-16, so a wchar* should be fine.
- Regan Heath (2/8) Oct 05 2007 Ahh, of course, silly me :)
This bug report http://d.puremagic.com/issues/show_bug.cgi?id=1357 "Cannot use FFFF and FFFE in Unicode escape sequences." has drifted into a discussion about which type should be used for D bindings of C string-processing functions. I hereby propose that we continue the discussion here. I'll summarise what's been discussed so far here, and add a few points. First it was proposed that the C string type should become ubyte*, rather than char*, in D. (Actually, whether it should be byte or ubyte depends on whether the underlying C implementation treats unqualified char as signed or unsigned. But it probably doesn't matter to the implementations of most string functions.) C APIs use the char type and its derivatives for two main purposes: - code units in an arbitrary 8-bit character encoding - byte-sized integer values of arbitrary semantics Both these uses distinguish it from D's char type, which is intended specifically for holding UTF-8 code units. For both these uses, byte or ubyte is more appropriate, and so it would make sense to have these types as the standard in D for communicating such data with C APIs. This would entail: - changing the string functions in std.c.* to use ubyte where they use char at the moment - changing std.string.toStringz to return ubyte* instead of char*, and having a corresponding char[] toString(ubyte*) function - changing the language to allow string literals to serve as ubyte* as well as the types that they already serve as. A number of C types have been renamed in D: http://www.digitalmars.com/d/htod.html "Type Mappings" Taking the view that C's char type is renamed ubyte would be just another of these. I'm now going to respond to Matti's latest comments on Bugzilla:It'd have to go beyond just string literals: string foo = "asdf"; int i = strlen(foo.ptr); Bad example, I know (who needs strlen?), but the above should work without having to cast foo.ptr from char* (or invariant(char)* if that's what it is in 2.0) to ubyte*. Passing through toStringz at every call may not be an option.Why might it not be an option? And what about having to pass it through toStringz anyway, for the very reason toStringz exists in the first place?The "looks like C, acts like C" principle doesn't seem to be consistently applied - the switch default error and the renaming of long long to long are just two places where it breaks down. But this is a case where the difference would be in whether it compiles or not, and so it's more a matter of D not being source-compatible with C, which is a good design decision indeed. Further comments? Stewart. -- My e-mail address is valid but not my primary mailbox. Please keep replies on the 'group where everybody may benefit.But is it really an inconsistency? Really, all that's happened is that C's signed char has been renamed as byte, and C's unsigned char as ubyte. It's no more inconsistent than unsigned int being renamed uint, and long long being renamed long.No, not really, but Walter seems to think it important that code that looks like C should work like it does in C. I agree with that sentiment to a point, and thus minimizing such inconsistencies is a good idea. In this case, however, I'd rather have the inconsistency.
Oct 04 2007
Stewart Gordon wrote:This bug report http://d.puremagic.com/issues/show_bug.cgi?id=1357 "Cannot use FFFF and FFFE in Unicode escape sequences." has drifted into a discussion about which type should be used for D bindings of C string-processing functions. I hereby propose that we continue the discussion here.Good idea. But note that I'm not talking only about C string-processing functions: in general, any functions which process strings without regard to their encoding should use ubytes. Just about all of std.string are such, for instance. The Tango situation is better, since tango.text.Util is already templated for char/wchar/dchar: ubyte would need to be added to the mix. <snip>One problem with toStringz is efficiency. Its current implementation of performs a string concatenation every time. If you know the string is zero terminated and ASCII (or you just want it to be handled as encoding-agnostic), you should just be able to pass it through. But on second thought, having the cast (or a call to toStringz) be necessary might be better. If you want UTF-8 to be handled as encoding-agnostic, a necessary cast may be a good idea, as it implies you know what you're doing. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fiIt'd have to go beyond just string literals: string foo = "asdf"; int i = strlen(foo.ptr); Bad example, I know (who needs strlen?), but the above should work without having to cast foo.ptr from char* (or invariant(char)* if that's what it is in 2.0) to ubyte*. Passing through toStringz at every call may not be an option.Why might it not be an option? And what about having to pass it through toStringz anyway, for the very reason toStringz exists in the first place?
Oct 04 2007
"Matti Niemenmaa" <see_signature for.real.address> wrote in message news:fe2t70$2eka$1 digitalmars.com... <snip>Good idea. But note that I'm not talking only about C string-processing functions: in general, any functions which process strings without regard to their encoding should use ubytes. Just about all of std.string are such, for instance.Looks like I'll have to investigate....The Tango situation is better, since tango.text.Util is already templated for char/wchar/dchar: ubyte would need to be added to the mix.<snip>One problem with toStringz is efficiency. Its current implementation of performs a string concatenation every time. If you know the string is zero terminated and ASCII (or you just want it to be handled as encoding-agnostic), you should just be able to pass it through.I had no idea that the implementation had changed.But on second thought, having the cast (or a call to toStringz) be necessary might be better. If you want UTF-8 to be handled as encoding-agnostic, a necessary cast may be a good idea, as it implies you know what you're doing.Why should I care that a function is encoding-agnostic if I know what encoding my text is in? That sounds to me like suggesting that I should have to cast class instances explicitly to Object to prove I know that the function can use objects of any class. Stewart. -- My e-mail address is valid but not my primary mailbox. Please keep replies on the 'group where everybody may benefit.
Oct 04 2007
Stewart Gordon wrote:"Matti Niemenmaa" <see_signature for.real.address> wrote in message news:fe2t70$2eka$1 digitalmars.com...You're right. I was going to write a long answer involving accidentally calling an ubyte-taking function instead of a char-taking function when I realized that having char implicitly convert to ubyte doesn't mean you can't overload functions taking both. My bad! -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fiBut on second thought, having the cast (or a call to toStringz) be necessary might be better. If you want UTF-8 to be handled as encoding-agnostic, a necessary cast may be a good idea, as it implies you know what you're doing.Why should I care that a function is encoding-agnostic if I know what encoding my text is in? That sounds to me like suggesting that I should have to cast class instances explicitly to Object to prove I know that the function can use objects of any class.
Oct 04 2007
Stewart Gordon wrote:First it was proposed that the C string type should become ubyte*, rather than char*, in D. (Actually, whether it should be byte or ubyte depends on whether the underlying C implementation treats unqualified char as signed or unsigned. But it probably doesn't matter to the implementations of most string functions.)+votesThis would entail:...- changing the language to allow string literals to serve as ubyte* as well as the types that they already serve as.In this case doesn't the compiler need to know which encoding to use? For example say you're passing a string literal to MessageBoxA it needs to encode the string literal using the local code page, right? Shouldn't it also be possible to pass a string literal to MessageBoxW, in which case the same treatment needs to be applied to wchar_t* AKA short*. Regan
Oct 04 2007
Regan Heath wrote:Shouldn't it also be possible to pass a string literal to MessageBoxW, in which case the same treatment needs to be applied to wchar_t* AKA short*.IIRC MessageBoxW takes its string in UTF-16, so a wchar* should be fine.
Oct 05 2007
Frits van Bommel wrote:Regan Heath wrote:Ahh, of course, silly me :)Shouldn't it also be possible to pass a string literal to MessageBoxW, in which case the same treatment needs to be applied to wchar_t* AKA short*.IIRC MessageBoxW takes its string in UTF-16, so a wchar* should be fine.
Oct 05 2007