www.digitalmars.com         C & C++   DMDScript  

D - BSTR-style array length

reply Charles Oliver Nutter <Charles_member pathlink.com> writes:
Forgive if this discussion has been had before; I could not find a way to search
the forums.

Perhaps you are familiar with how COM's BSTR represents string/char arrays.
Where D's arrays are represented as an 8-byte value, with offset 0 a 32-bit
pointer to the array contents and offset 4 representing the 32-bit length of the
array, a BSTR includes the length at the memory location of the char (wchar)
array, but behind the pointer, as in:

BSTR variable:
Offset:     Contents:
0        pointer to array data

Array Data:
Offset:     Contents:
0        first char of array
-4        32 bit length of array

Naturally, this requires additional APIs in Windows for managing BSTRs, but it
provides direct compatibility with existing wchar strings, since the pointer
still points at the first character in the array. It also has the benefit of
remaining a 32 bit value and retaining length information at the pointed-to
memory location.

A similar approach in D could allow its arrays to remain a 32 bit pointer while
retaining the smart aspects. An external API for managing such would not be
necessary, since D already handles arrays specially. The additional benefit here
would be the ability to pass array pointers into C/others and get them back
without losing the length data (since the length is also stored at or "near" the
pointer location), where currently passing only the first 32 bits of the array
pointer into a non-D-aware language will "kill" the length data when it comes
back out.

Has this been considered?

- Charlie
Apr 19 2004
next sibling parent reply Russ Lewis <spamhole-2001-07-16 deming-os.org> writes:
I think that things are even simpler than what you expect.

D arrays can be implicitly cast to pointers.  So this code works:
	char[] foo = "something";
	char *bar = foo;
The same happens with function calls into C library functions:
	extern(C) int baz(char *arg1);
	int rc = baz(foo);
The baz() function will see a pointer to the 's' character of "something".

This works pretty well, but you have to remember a few things:

1) D strings are not null terminated.

Since we have a built-in length field, D strings do not need to be 
null-terminated.  So, in most cases, you need to add a null terminator 
to the string before sending it to C.  The standard library function:
	import std.string;
	char *toStringz(char[]);
will do this for you.  So if you don't know whether your string is null 
terminated or not, you would call it like this:
	baz(toStringz(foo));

FYI: The compiler is smart enough to add a null after all constant 
strings.  (This null does NOT count toward the total length of the 
string.)  So, if you are passing an argument which is a constant string 
(not something you've generated at runtime), then toStringz() is 
unnecessary.  You can call it, but it won't do anything because it will 
see that there is a null out just past the "end" of the string.

2) D doesn't reset the length if C modifies the string.

So if C changes the length, your D code will have to account for that by 
hand.  The easiest way to do this is to reinitialize the array using the 
slice syntax and the strlen() function:
	import std.string; // gives you extern(C) int strlen(char*)
	int rc = baz(foo);
	foo = foo[0..strlen(foo)];
But you would have to do something like that with BSTR as well.

Charles Oliver Nutter wrote:
 Forgive if this discussion has been had before; I could not find a way to
search
 the forums.
 
 Perhaps you are familiar with how COM's BSTR represents string/char arrays.
 Where D's arrays are represented as an 8-byte value, with offset 0 a 32-bit
 pointer to the array contents and offset 4 representing the 32-bit length of
the
 array, a BSTR includes the length at the memory location of the char (wchar)
 array, but behind the pointer, as in:
 
 BSTR variable:
 Offset:     Contents:
 0        pointer to array data
 
 Array Data:
 Offset:     Contents:
 0        first char of array
 -4        32 bit length of array
 
 Naturally, this requires additional APIs in Windows for managing BSTRs, but it
 provides direct compatibility with existing wchar strings, since the pointer
 still points at the first character in the array. It also has the benefit of
 remaining a 32 bit value and retaining length information at the pointed-to
 memory location.
 
 A similar approach in D could allow its arrays to remain a 32 bit pointer while
 retaining the smart aspects. An external API for managing such would not be
 necessary, since D already handles arrays specially. The additional benefit
here
 would be the ability to pass array pointers into C/others and get them back
 without losing the length data (since the length is also stored at or "near"
the
 pointer location), where currently passing only the first 32 bits of the array
 pointer into a non-D-aware language will "kill" the length data when it comes
 back out.
 
 Has this been considered?
 
 - Charlie

Apr 19 2004
parent reply Charles Oliver Nutter <headius headius.com> writes:
The thought I had about "losing" the length would happen in the 
following case for D strings but not for BSTR:

char[256] mystring;
mystring = some_c_function_that_returns_char_pointer(mystring);

At this point, since we've passed 32 bits in and gotten 32 bits out, 
we've lost the length part of the variable. Perhaps this assignment is 
not even legal?

If the length was stored at the memory location pointed to by the 
variable, rather than in the variable itself, the length would never be 
lost simply by passing the value around.

There may not be a lot of functions that return a pointer to unmodified 
memory contents, but perhaps this illustrates where the BSTR method of 
length management avoids a possible problem.

I have another question regarding #1 below: I'm sure this has been 
discussed, but why not null-terminate strings? Considering the vast 
number of APIs that expect strings to be null terminated, wouldn't it 
make sense? I can appreciate wanting all arrays to be uniform, but 
anyone using D would never have to see the difference between char 
arrays and all other arrays, and existing APIs that expect terminating 
null chars would work without quirks.

- Charlie

Russ Lewis wrote:
 I think that things are even simpler than what you expect.
 
 D arrays can be implicitly cast to pointers.  So this code works:
     char[] foo = "something";
     char *bar = foo;
 The same happens with function calls into C library functions:
     extern(C) int baz(char *arg1);
     int rc = baz(foo);
 The baz() function will see a pointer to the 's' character of "something".
 
 This works pretty well, but you have to remember a few things:
 
 1) D strings are not null terminated.
 
 Since we have a built-in length field, D strings do not need to be 
 null-terminated.  So, in most cases, you need to add a null terminator 
 to the string before sending it to C.  The standard library function:
     import std.string;
     char *toStringz(char[]);
 will do this for you.  So if you don't know whether your string is null 
 terminated or not, you would call it like this:
     baz(toStringz(foo));
 
 FYI: The compiler is smart enough to add a null after all constant 
 strings.  (This null does NOT count toward the total length of the 
 string.)  So, if you are passing an argument which is a constant string 
 (not something you've generated at runtime), then toStringz() is 
 unnecessary.  You can call it, but it won't do anything because it will 
 see that there is a null out just past the "end" of the string.
 
 2) D doesn't reset the length if C modifies the string.
 
 So if C changes the length, your D code will have to account for that by 
 hand.  The easiest way to do this is to reinitialize the array using the 
 slice syntax and the strlen() function:
     import std.string; // gives you extern(C) int strlen(char*)
     int rc = baz(foo);
     foo = foo[0..strlen(foo)];
 But you would have to do something like that with BSTR as well.
 
 Charles Oliver Nutter wrote:
 
 Forgive if this discussion has been had before; I could not find a way 
 to search
 the forums.

 Perhaps you are familiar with how COM's BSTR represents string/char 
 arrays.
 Where D's arrays are represented as an 8-byte value, with offset 0 a 
 32-bit
 pointer to the array contents and offset 4 representing the 32-bit 
 length of the
 array, a BSTR includes the length at the memory location of the char 
 (wchar)
 array, but behind the pointer, as in:

 BSTR variable:
 Offset:     Contents:
 0        pointer to array data

 Array Data:
 Offset:     Contents:
 0        first char of array
 -4        32 bit length of array

 Naturally, this requires additional APIs in Windows for managing 
 BSTRs, but it
 provides direct compatibility with existing wchar strings, since the 
 pointer
 still points at the first character in the array. It also has the 
 benefit of
 remaining a 32 bit value and retaining length information at the 
 pointed-to
 memory location.

 A similar approach in D could allow its arrays to remain a 32 bit 
 pointer while
 retaining the smart aspects. An external API for managing such would 
 not be
 necessary, since D already handles arrays specially. The additional 
 benefit here
 would be the ability to pass array pointers into C/others and get them 
 back
 without losing the length data (since the length is also stored at or 
 "near" the
 pointer location), where currently passing only the first 32 bits of 
 the array
 pointer into a non-D-aware language will "kill" the length data when 
 it comes
 back out.

 Has this been considered?

 - Charlie


Apr 19 2004
parent reply "Walter" <walter digitalmars.com> writes:
"Charles Oliver Nutter" <headius headius.com> wrote in message
news:c61hok$1teo$1 digitaldaemon.com...
 If the length was stored at the memory location pointed to by the
 variable, rather than in the variable itself, the length would never be
 lost simply by passing the value around.

Yes, it would, because there's no way to tell if whatever happens to be before the string data is a valid length or a random bit pattern.
 I have another question regarding #1 below: I'm sure this has been
 discussed, but why not null-terminate strings? Considering the vast
 number of APIs that expect strings to be null terminated, wouldn't it
 make sense? I can appreciate wanting all arrays to be uniform, but
 anyone using D would never have to see the difference between char
 arrays and all other arrays, and existing APIs that expect terminating
 null chars would work without quirks.

Null terminated strings cannot be sliced, and cause problems representing binary data.
Apr 19 2004
parent reply "Robert Atkinson" <z1zg1 NO.SPAM.unb.ca> writes:
Why not accomodate both?
Have all strings allocate an extra char internally, and always have it be
\0?

For D's purposes, the BSTR length property would be used, and the extra char
ignored, but it would always be there for calling C functions expecting a
zero terminated string.

- Rob

"Walter" <walter digitalmars.com> wrote in message
news:c61lul$25nn$1 digitaldaemon.com...
 "Charles Oliver Nutter" <headius headius.com> wrote in message
 news:c61hok$1teo$1 digitaldaemon.com...
 If the length was stored at the memory location pointed to by the
 variable, rather than in the variable itself, the length would never be
 lost simply by passing the value around.

Yes, it would, because there's no way to tell if whatever happens to be before the string data is a valid length or a random bit pattern.
 I have another question regarding #1 below: I'm sure this has been
 discussed, but why not null-terminate strings? Considering the vast
 number of APIs that expect strings to be null terminated, wouldn't it
 make sense? I can appreciate wanting all arrays to be uniform, but
 anyone using D would never have to see the difference between char
 arrays and all other arrays, and existing APIs that expect terminating
 null chars would work without quirks.

Null terminated strings cannot be sliced, and cause problems representing binary data.

Apr 19 2004
parent "Walter" <walter digitalmars.com> writes:
"Robert Atkinson" <z1zg1 NO.SPAM.unb.ca> wrote in message
news:c61n39$27o5$1 digitaldaemon.com...
 Why not accomodate both?

There's no reason why one cannot program in D right now using char*'s and null terminated strings.
 Have all strings allocate an extra char internally, and always have it be
 \0?

This is already done for static string literals.
 For D's purposes, the BSTR length property would be used, and the extra

 ignored, but it would always be there for calling C functions expecting a
 zero terminated string.

One cannot slice a string and put a null at the end without corrupting memory. But if you stick with char*'s and the C string library functions, it will work fine with 0 terminated strings.
Apr 19 2004
prev sibling parent "Walter" <walter digitalmars.com> writes:
"Charles Oliver Nutter" <Charles_member pathlink.com> wrote in message
news:c61fq7$1pim$1 digitaldaemon.com...
 A similar approach in D could allow its arrays to remain a 32 bit pointer

 retaining the smart aspects. An external API for managing such would not

 necessary, since D already handles arrays specially. The additional

 would be the ability to pass array pointers into C/others and get them

 without losing the length data (since the length is also stored at or

 pointer location), where currently passing only the first 32 bits of the

 pointer into a non-D-aware language will "kill" the length data when it

 back out.

 Has this been considered?

Yes. The good reasons for it are as you stated. The reason against it is such a technique does not allow for array 'slicing'. Other problems are things like a C string cannot be converted into a BSTR without copying the entire string.
Apr 19 2004