www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - C strings - byte, ubyte or char? (Discussion from Bugzilla)

reply "Stewart Gordon" <smjg_1998 yahoo.com> writes:
This bug report
http://d.puremagic.com/issues/show_bug.cgi?id=1357
"Cannot use FFFF and FFFE in Unicode escape sequences."

has drifted into a discussion about which type should be used for D bindings 
of C string-processing functions.  I hereby propose that we continue the 
discussion here.

I'll summarise what's been discussed so far here, and add a few points.

First it was proposed that the C string type should become ubyte*, rather 
than char*, in D.  (Actually, whether it should be byte or ubyte depends on 
whether the underlying C implementation treats unqualified char as signed or 
unsigned.  But it probably doesn't matter to the implementations of most 
string functions.)

C APIs use the char type and its derivatives for two main purposes:
- code units in an arbitrary 8-bit character encoding
- byte-sized integer values of arbitrary semantics

Both these uses distinguish it from D's char type, which is intended 
specifically for holding UTF-8 code units.  For both these uses, byte or 
ubyte is more appropriate, and so it would make sense to have these types as 
the standard in D for communicating such data with C APIs.

This would entail:
- changing the string functions in std.c.* to use ubyte where they use char 
at the moment
- changing std.string.toStringz to return ubyte* instead of char*, and 
having a corresponding char[] toString(ubyte*) function
- changing the language to allow string literals to serve as ubyte* as well 
as the types that they already serve as.

A number of C types have been renamed in D:
http://www.digitalmars.com/d/htod.html
"Type Mappings"

Taking the view that C's char type is renamed ubyte would be just another of 
these.

I'm now going to respond to Matti's latest comments on Bugzilla:

 It'd have to go beyond just string literals:

 string foo = "asdf";
 int i = strlen(foo.ptr);

 Bad example, I know (who needs strlen?), but the above should work
 without having to cast foo.ptr from char* (or invariant(char)* if
 that's what it is in 2.0) to ubyte*.  Passing through toStringz at
 every call may not be an option.
Why might it not be an option? And what about having to pass it through toStringz anyway, for the very reason toStringz exists in the first place?
 But is it really an inconsistency?  Really, all that's happened is
 that C's signed char has been renamed as byte, and C's unsigned
 char as ubyte.  It's no more inconsistent than unsigned int being
 renamed uint, and long long being renamed long.
No, not really, but Walter seems to think it important that code that looks like C should work like it does in C. I agree with that sentiment to a point, and thus minimizing such inconsistencies is a good idea. In this case, however, I'd rather have the inconsistency.
The "looks like C, acts like C" principle doesn't seem to be consistently applied - the switch default error and the renaming of long long to long are just two places where it breaks down. But this is a case where the difference would be in whether it compiles or not, and so it's more a matter of D not being source-compatible with C, which is a good design decision indeed. Further comments? Stewart. -- My e-mail address is valid but not my primary mailbox. Please keep replies on the 'group where everybody may benefit.
Oct 04 2007
next sibling parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Stewart Gordon wrote:
 This bug report
 http://d.puremagic.com/issues/show_bug.cgi?id=1357
 "Cannot use FFFF and FFFE in Unicode escape sequences."
 
 has drifted into a discussion about which type should be used for D
 bindings of C string-processing functions.  I hereby propose that we
 continue the discussion here.
Good idea. But note that I'm not talking only about C string-processing functions: in general, any functions which process strings without regard to their encoding should use ubytes. Just about all of std.string are such, for instance. The Tango situation is better, since tango.text.Util is already templated for char/wchar/dchar: ubyte would need to be added to the mix. <snip>
 It'd have to go beyond just string literals:

 string foo = "asdf";
 int i = strlen(foo.ptr);

 Bad example, I know (who needs strlen?), but the above should work
 without having to cast foo.ptr from char* (or invariant(char)* if
 that's what it is in 2.0) to ubyte*.  Passing through toStringz at
 every call may not be an option.
Why might it not be an option? And what about having to pass it through toStringz anyway, for the very reason toStringz exists in the first place?
One problem with toStringz is efficiency. Its current implementation of performs a string concatenation every time. If you know the string is zero terminated and ASCII (or you just want it to be handled as encoding-agnostic), you should just be able to pass it through. But on second thought, having the cast (or a call to toStringz) be necessary might be better. If you want UTF-8 to be handled as encoding-agnostic, a necessary cast may be a good idea, as it implies you know what you're doing. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Oct 04 2007
parent reply "Stewart Gordon" <smjg_1998 yahoo.com> writes:
"Matti Niemenmaa" <see_signature for.real.address> wrote in message 
news:fe2t70$2eka$1 digitalmars.com...
<snip>
 Good idea. But note that I'm not talking only about C string-processing
 functions: in general, any functions which process strings without regard 
 to
 their encoding should use ubytes.

 Just about all of std.string are such, for instance.
Looks like I'll have to investigate....
 The Tango situation is
 better, since tango.text.Util is already templated for char/wchar/dchar: 
 ubyte
 would need to be added to the mix.
<snip>
 One problem with toStringz is efficiency. Its current implementation of 
 performs
 a string concatenation every time. If you know the string is zero 
 terminated and
 ASCII (or you just want it to be handled as encoding-agnostic), you should 
 just
 be able to pass it through.
I had no idea that the implementation had changed.
 But on second thought, having the cast (or a call to toStringz) be 
 necessary
 might be better. If you want UTF-8 to be handled as encoding-agnostic, a
 necessary cast may be a good idea, as it implies you know what you're 
 doing.
Why should I care that a function is encoding-agnostic if I know what encoding my text is in? That sounds to me like suggesting that I should have to cast class instances explicitly to Object to prove I know that the function can use objects of any class. Stewart. -- My e-mail address is valid but not my primary mailbox. Please keep replies on the 'group where everybody may benefit.
Oct 04 2007
parent Matti Niemenmaa <see_signature for.real.address> writes:
Stewart Gordon wrote:
 "Matti Niemenmaa" <see_signature for.real.address> wrote in message
 news:fe2t70$2eka$1 digitalmars.com...
 But on second thought, having the cast (or a call to toStringz) be
 necessary
 might be better. If you want UTF-8 to be handled as encoding-agnostic, a
 necessary cast may be a good idea, as it implies you know what you're
 doing.
Why should I care that a function is encoding-agnostic if I know what encoding my text is in? That sounds to me like suggesting that I should have to cast class instances explicitly to Object to prove I know that the function can use objects of any class.
You're right. I was going to write a long answer involving accidentally calling an ubyte-taking function instead of a char-taking function when I realized that having char implicitly convert to ubyte doesn't mean you can't overload functions taking both. My bad! -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Oct 04 2007
prev sibling parent reply Regan Heath <regan netmail.co.nz> writes:
Stewart Gordon wrote:
 First it was proposed that the C string type should become ubyte*, 
 rather than char*, in D.  (Actually, whether it should be byte or ubyte 
 depends on whether the underlying C implementation treats unqualified 
 char as signed or unsigned.  But it probably doesn't matter to the 
 implementations of most string functions.)
+votes
 This would entail:
...
 - changing the language to allow string literals to serve as ubyte* as 
 well as the types that they already serve as.
In this case doesn't the compiler need to know which encoding to use? For example say you're passing a string literal to MessageBoxA it needs to encode the string literal using the local code page, right? Shouldn't it also be possible to pass a string literal to MessageBoxW, in which case the same treatment needs to be applied to wchar_t* AKA short*. Regan
Oct 04 2007
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Regan Heath wrote:
 Shouldn't it also be possible to pass a string literal to MessageBoxW, 
 in which case the same treatment needs to be applied to wchar_t* AKA 
 short*.
IIRC MessageBoxW takes its string in UTF-16, so a wchar* should be fine.
Oct 05 2007
parent Regan Heath <regan netmail.co.nz> writes:
Frits van Bommel wrote:
 Regan Heath wrote:
 Shouldn't it also be possible to pass a string literal to MessageBoxW, 
 in which case the same treatment needs to be applied to wchar_t* AKA 
 short*.
IIRC MessageBoxW takes its string in UTF-16, so a wchar* should be fine.
Ahh, of course, silly me :)
Oct 05 2007