www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - String literal consistency and implicit array conversions..

reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
Okay, here's another array topic to discuss while we're on this whole "null 
array" issue.

Recently, I was working on some stuff that interfaced with a C function that 
took a char*.  It was working fine.  Then, I changed something, and started 
getting access violations.  Guess what it was?

myFunc("a string"); // works fine
myFunc(`a wysiwyg string?`); // access violation

I tracked down the problem - I wasn't calling toStringz() on the char[] 
before passing it to the C function.

This weird problem occurred because double-quoted string literals in D are 
null-terminated.  I guess this is for quick-and-dirty interop with C 
libraries.  This is inconsistent with D's design, however.   D strings 
should not be null-terminated, as in D, strings are represented by a length 
and data.  Any interaction with strings in C functions should use 
toStringz() on the char[] before passing it into the C function.

The second issue I had with this was that I was simply passing the array 
into the C function, like so:

void myFunc(char[] str)
{
    someCFunc(str);
}

This is an issue.  someCFunc() takes a char*.  str is not a char*, it's a 
char[].  They are two different things, and a D char[] does not have any 
built-in type that resembles it in C.  It should not be possible to 
implicitly cast T[] to T*; that's why the .ptr property was invented.  I 
can't think of any time when you would need to implicitly cast T[] to T* 
besides a quick-and-dirty way to pass arrays into C functions.  It's sloppy.

Interfacing with C functions should be possible in D, but it should be 
obvious when it happens.  If implicit conversion from T[] to T* is made 
illegal, it will make bugs like passing an un-toStringz'ed char[] into a C 
function show up. 
Jul 31 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Sun, 31 Jul 2005 12:37:31 -0400, Jarrett Billingsley  
<kb3ctd2 yahoo.com> wrote:
 Okay, here's another array topic to discuss while we're on this whole  
 "null array" issue.

 Recently, I was working on some stuff that interfaced with a C function  
 that took a char*.  It was working fine.  Then, I changed something, and  
 started getting access violations.  Guess what it was?

 myFunc("a string"); // works fine
 myFunc(`a wysiwyg string?`); // access violation

 I tracked down the problem - I wasn't calling toStringz() on the char[]
 before passing it to the C function.

 This weird problem occurred because double-quoted string literals in D  
 are null-terminated.  I guess this is for quick-and-dirty interop with C
 libraries.  This is inconsistent with D's design, however.   D strings
 should not be null-terminated, as in D, strings are represented by a  
 length and data.  Any interaction with strings in C functions should use
 toStringz() on the char[] before passing it into the C function.

 The second issue I had with this was that I was simply passing the array
 into the C function, like so:

 void myFunc(char[] str)
 {
     someCFunc(str);
 }

 This is an issue.  someCFunc() takes a char*.  str is not a char*, it's a
 char[].  They are two different things, and a D char[] does not have any
 built-in type that resembles it in C.  It should not be possible to
 implicitly cast T[] to T*; that's why the .ptr property was invented.  I
 can't think of any time when you would need to implicitly cast T[] to T*
 besides a quick-and-dirty way to pass arrays into C functions.  It's  
 sloppy.

 Interfacing with C functions should be possible in D, but it should be
 obvious when it happens.  If implicit conversion from T[] to T* is made
 illegal, it will make bugs like passing an un-toStringz'ed char[] into a  
 C function show up.

Interesting.. Technically C's "char*" type is analogous to D's "byte*" type, not D's "char*" type. So, technically the solution is to replace all the "char*" in phobos' C function declarations with "byte*". This will cause errors everywhere a char[] is passed as byte*. To solve these I would replace toStringz with several toX functions where 'X' is the character set required. We'd need to write these functions, they could verify and/or convert the data if required to the character set. The code would then resemble: char* s = strchr(toISO9660("abc"),'a'); Or, if you prefer a less invasive solution... In C a "char*" will be null terminated in all but the strangest cases. I'd guess 99.9% of cases are null terminated. So, when interfacing with C 99.9% of the time "char*" instances should be null terminated. I don't see much use of "char*" in straight D code, char[] is simply a much better choice. So, the solution? Well, the one proposed: - make T[] to T* illegal. Would probably solve the issue. Though a programmer could still change "array" to "array.ptr" and remove the error but not the crash. Consider however byte, long, int or other arrays. These arrays are not typically "null terminated" because the value 0 generally has no special meaning for these types. In fact when using arrays of these types in C a special value is chosen which is outside the range of possible values, if that is possible, if not a length is passed with the array. So, as a general rule: - make T[] to T* illegal. has a negative impact on the usability of other array types WRT calling C functions. Granted these array types are not used anywhere near as often as char. So, the solution: If we assume 99.9% of char* cases should be null terminated, can we ensure/make 100% of cases null terminated at no (or small) negative effect, in other words: - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly call toStringz This will solve the issue, implicitly, silently, in fact most users won't even realise it's being done. That, however is the negative aspect, this could cause a reallocation and it's all done silenty. Regan
Jul 31 2005
parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message 
news:opsusta6aa23k2f5 nrage.netwin.co.nz...
 Technically C's "char*" type is analogous to D's "byte*" type, not D's 
 "char*" type.

I was getting more at the fact that C doesn't represent arrays in the same way that D does, but that interpretation works too.
 So, technically the solution is to replace all the "char*" in phobos' C 
 function declarations with "byte*". This will cause errors everywhere a 
 char[] is passed as byte*.

 To solve these I would replace toStringz with several toX functions where 
 'X' is the character set required. We'd need to write these functions, 
 they could verify and/or convert the data if required to the character 
 set.

 The code would then resemble:

 char* s = strchr(toISO9660("abc"),'a');

No offense, but I really don't have any idea what you're getting at here :)
 Or, if you prefer a less invasive solution...

 In C a "char*" will be null terminated in all but the strangest cases. I'd 
 guess 99.9% of cases are null terminated. So, when interfacing with C 
 99.9% of the time "char*" instances should be null terminated. I don't see 
 much use of "char*" in straight D code, char[] is simply a much better 
 choice.

 So, the solution? Well, the one proposed:
  - make T[] to T* illegal.

 Would probably solve the issue. Though a programmer could still change 
 "array" to "array.ptr" and remove the error but not the crash.

Yes, but making this illegal char[] someString; someCFunc(someString); // error, no implicit cast from char[] to char* Should be a red flag that says "oh yeah, this is a string, I should probably be doing something to it." Sure, you could write someString.ptr, but hopefully you'll realize that you need to write toStringz(someString) instead.
 Consider however byte, long, int or other arrays. These arrays are not 
 typically "null terminated" because the value 0 generally has no special 
 meaning for these types. In fact when using arrays of these types in C a 
 special value is chosen which is outside the range of possible values, if 
 that is possible, if not a length is passed with the array.

 So, as a general rule:
  - make T[] to T* illegal.

 has a negative impact on the usability of other array types WRT calling C 
 functions. Granted these array types are not used anywhere near as often 
 as char.

The frequency that you pass a numerical array to a C function is very low indeed, and the only change that would need to be made would be writing ".ptr" explicitly. If nothing else, it makes it more obvious that the array is being turned into a C-style "array."
 So, the solution: If we assume 99.9% of char* cases should be null 
 terminated, can we ensure/make 100% of cases null terminated at no (or 
 small) negative effect, in other words:
  - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly 
 call toStringz

 This will solve the issue, implicitly, silently, in fact most users won't 
 even realise it's being done. That, however is the negative aspect, this 
 could cause a reallocation and it's all done silenty.

Most of the time, when passing a string to a C function, you're going to be calling toStringz anyway. And about the only time you take advantage of the implicit casting from [w/d]char[] to [w/d]char* is when passing a string to a C function. So there wouldn't really be any loss in speed, though perhaps a non-trivial operation such as toStringz shouldn't be done implicitly, for the sake of clarity. How about a .toStringz property for char[]? ;)
Jul 31 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Sun, 31 Jul 2005 19:09:59 -0400, Jarrett Billingsley  
<kb3ctd2 yahoo.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsusta6aa23k2f5 nrage.netwin.co.nz...
 Technically C's "char*" type is analogous to D's "byte*" type, not D's
 "char*" type.

I was getting more at the fact that C doesn't represent arrays in the same way that D does, but that interpretation works too.

I think this is the root of the problem. D's arrays are definately not C's arrays. At the same time D's "char*" is not C's "char*" either (tho these are more similar) because D's char* is UTF-8 encoded by definition. C's char* has no definite encoding. D's "byte*" is C's "char*", an array of signed 8 bit values.
 So, technically the solution is to replace all the "char*" in phobos' C
 function declarations with "byte*". This will cause errors everywhere a
 char[] is passed as byte*.

 To solve these I would replace toStringz with several toX functions  
 where
 'X' is the character set required. We'd need to write these functions,
 they could verify and/or convert the data if required to the character
 set.

 The code would then resemble:

 char* s = strchr(toISO9660("abc"),'a');

No offense, but I really don't have any idea what you're getting at here :)

toStringz takes char/wchar/dchar and simply ensures it has a trailing null character. It does not deal with whether it is encoded in the correct character set. My knowledge of character encodings is a little shaky, so someone correct me if I'm wrong.. C functions expect a certain encoding, that is defined by locale information. I believe it can differ on each PC based on the users language etc. On my PC I'd guess it's Windows-1252 (or something). D's char[] is UTF-8, UTF-8 is a super-set of ASCII and Windows-1252. Because UTF-8 is a superset it contains characters that cannot be represented in Windows-1252, chinese characters for example. If a D char[] contains one of these characters and you pass it to a C function you will get strange results (because it expects Windows-1252 and you're sending it UTF-8). So, instead of simply ensuring the trailing null is present why not ensure the encoding is correct also. Thinking about it, I believe we can obtain the locale information and transcode to the local encoding and/or throw an exception if it's impossible automatically. So, in fact, (changing my mind/idea here) we don't need a seperate function for each encoding but can do the transcoding to the local encoding automatically inside toStringz. If we want to leave toStringz "as is" we could instead add a toStringLocal function to do the transcoding. And/or a toCString function to do both. The transcoding requires some sort of character encoding library to be added to phobos, I believe there has been some work done in this area.. converting a C lib? (I forget the name)
 Or, if you prefer a less invasive solution...

 In C a "char*" will be null terminated in all but the strangest cases.  
 I'd
 guess 99.9% of cases are null terminated. So, when interfacing with C
 99.9% of the time "char*" instances should be null terminated. I don't  
 see
 much use of "char*" in straight D code, char[] is simply a much better
 choice.

 So, the solution? Well, the one proposed:
  - make T[] to T* illegal.

 Would probably solve the issue. Though a programmer could still change
 "array" to "array.ptr" and remove the error but not the crash.

Yes, but making this illegal char[] someString; someCFunc(someString); // error, no implicit cast from char[] to char* Should be a red flag that says "oh yeah, this is a string, I should probably be doing something to it." Sure, you could write someString.ptr, but hopefully you'll realize that you need to write toStringz(someString) instead.

Sure, "hopefully", that was my point. I'd prefer something less error prone.
 Consider however byte, long, int or other arrays. These arrays are not
 typically "null terminated" because the value 0 generally has no special
 meaning for these types. In fact when using arrays of these types in C a
 special value is chosen which is outside the range of possible values,  
 if
 that is possible, if not a length is passed with the array.

 So, as a general rule:
  - make T[] to T* illegal.

 has a negative impact on the usability of other array types WRT calling  
 C
 functions. Granted these array types are not used anywhere near as often
 as char.

The frequency that you pass a numerical array to a C function is very low indeed

Yes, like I said.
 , and the only change that would need to be made would be writing
 ".ptr" explicitly. If nothing else, it makes it more obvious that the  
 array is being turned into a C-style "array."

Yes, but that means it's .ptr for non string types and toStringz for string types. A little inconsistent, tho acceptable. I prefer the change to "byte*" and if not that, I'd prefer to implicitly convert from char[] to null terminated char*.
 So, the solution: If we assume 99.9% of char* cases should be null
 terminated, can we ensure/make 100% of cases null terminated at no (or
 small) negative effect, in other words:
  - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly
 call toStringz

 This will solve the issue, implicitly, silently, in fact most users  
 won't
 even realise it's being done. That, however is the negative aspect, this
 could cause a reallocation and it's all done silenty.

Most of the time, when passing a string to a C function, you're going to be calling toStringz anyway.

Yes, so make it implicit. I say.
 And about the only time you take advantage of the implicit casting from  
 [w/d]char[] to [w/d]char* is when passing a string to a C function.  So  
 there wouldn't really be any loss in speed, though perhaps a non-trivial  
 operation such as toStringz shouldn't be done implicitly, for the sake  
 of clarity.

Maybe. That is the potential negative aspect of the idea. I wonder what someone else thinks.
 How about a .toStringz property for char[]?  ;)

.ptr could give a null terminated char* (i.e. call toStringz on the data) if we want it to be explicit. Regan
Jul 31 2005