www.digitalmars.com         C & C++   DMDScript  

D - Automatic Safe and Efficient Sz-ing

reply "Matthew Wilson" <dmd synesis.com.au> writes:
An idea I had in my sleep, so please forgive if I've overlooked some huge
obvious beastie.

When interfacing a character array (btw, I'm with Mark in thinking we should
have a separate string class, but have not amassed my ammunition so am not
looking to engage in that debate yet)
to a C API expecting a null string, we have the options of

- not terminating - crash!
- terminating in the array via ~= (char)0;
- using toStringz() which seems from the implementation to contain most of
my sleepytime ideas for an efficient placement of a terminating null. Gah!

Nonetheless, I was wondering whether there was some way of making this call
implicit, perhaps in the declaration of the C function.

For example, strlen is declared thus

extern (C)
{
    int strlen(char *);
}

Would it be a nice thing to declare it

extern (C)
{
    int strlen(char null *);
}

and the D compiler would insert a call to toStringz() automatically?

Sure there is an efficiency argument against, but I suspect most of such C
calls that expect ZTS have to involve some similar treatment.

And really, the null decorator would not mean that "the compiler must call
toStringz", rather it could mean that "the compiler must ensure that the
string is zero-terminated". Hence the compiler would be free to optimise out
such a call where it is dealing with a literal, or static, or something that
it's already established is null terminated. For example, the code

void blah(char[] s)
{
    int    len1    =    strlen(s);
    int    len2    =    strlen(s);
}

Could be translated to

void blah(char[] s)
{
    char[]    s_zt    =    toStringz(s);
    int        len1    =    strlen(s_zt);
    int        len2    =    strlen(s_zt);
}

This would eradicate many of the problems that are likely to bite people
interfacing to C code, without in any way adding a cost to "pure" D.

Any takers?

Matthew
Mar 31 2003
parent reply Bill Cox <bill viasic.com> writes:
Hi, Matthew.

 When interfacing a character array (btw, I'm with Mark in thinking we should
 have a separate string class, but have not amassed my ammunition so am not
 looking to engage in that debate yet)

This is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill
Mar 31 2003
next sibling parent reply "Matthew Wilson" <dmd synesis.com.au> writes:
:)

Pragmatist is a lot more of a compliment than what I usually get: pedant.

Yes, the string stuff is highly toxic in C, C++ and (it seems) D. I am also,
however, wary of building in support for inefficient (in terms of speed, not
size) variable character length encoding schemes.

Is there are reason why UCS-32 (or is that UTF-32 - I need to go and digest
all that awful gunk again and get my terminology back up to speed), a la
wchar_t, Java, .NETis not sufficient?

I know that 65536 doesn't cover all the bases of _all_ languages, but it is
nevertheless used as a "complete" solution by so many languages, so is it
"near enough is good enough". Dunno, seems Mark's much more of an expert, so
hopefully he can enlighten me on that one.


Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of D,
where achievable.

Sure, fire away, but I think we should have it running for parliament. ;)

Percy the pragmatist

"Bill Cox" <bill viasic.com> wrote in message
news:3E88BE91.6010403 viasic.com...
 Hi, Matthew.

 When interfacing a character array (btw, I'm with Mark in thinking we


 have a separate string class, but have not amassed my ammunition so am


 looking to engage in that debate yet)

This is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill

Mar 31 2003
next sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
Correction: meant UCS-/UTF-16, not 32

"Matthew Wilson" <dmd synesis.com.au> wrote in message
news:b6aph7$1dbp$1 digitaldaemon.com...
 :)

 Pragmatist is a lot more of a compliment than what I usually get: pedant.

 Yes, the string stuff is highly toxic in C, C++ and (it seems) D. I am

 however, wary of building in support for inefficient (in terms of speed,

 size) variable character length encoding schemes.

 Is there are reason why UCS-32 (or is that UTF-32 - I need to go and

 all that awful gunk again and get my terminology back up to speed), a la
 wchar_t, Java, .NETis not sufficient?

 I know that 65536 doesn't cover all the bases of _all_ languages, but it

 nevertheless used as a "complete" solution by so many languages, so is it
 "near enough is good enough". Dunno, seems Mark's much more of an expert,

 hopefully he can enlighten me on that one.


 Anyway, Bill, everyone, do you like the "char null *" idea?
 - Doesn't introduce another keyword.
 - Surely not hard to parse.
 - Improves robustness.
 - Doesn't add operations that would not have to be done anyway.
 - Leaves it all to compiler's best discretion, so plenty of chances for
 being _faster_ than leaving it up to user, which seems to be a theme of D,
 where achievable.

 Sure, fire away, but I think we should have it running for parliament. ;)

 Percy the pragmatist

 "Bill Cox" <bill viasic.com> wrote in message
 news:3E88BE91.6010403 viasic.com...
 Hi, Matthew.

 When interfacing a character array (btw, I'm with Mark in thinking we


 have a separate string class, but have not amassed my ammunition so am


 looking to engage in that debate yet)

This is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill


Mar 31 2003
prev sibling next sibling parent reply Bill Cox <Bill_member pathlink.com> writes:
In article <b6aph7$1dbp$1 digitaldaemon.com>, Matthew Wilson says...
Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of D,
where achievable.

From a user point of view, I like the char null*. The single most common "Help!, I've crashed my simple D program" post on this newsgroup seems to have to do with the terminating null, and how it interacts with character array slicing. I'd be nice to help clear that one up. I don't know how hard the support would be. I'd have to be pretty hard to amount to more of Walter's time than dealing with the confused D users. Bill
Mar 31 2003
next sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
 I'd be nice to help clear that one up.  I don't know how hard the support

 be.  I'd have to be pretty hard to amount to more of Walter's time than

 with the confused D users.

Good point. Maybe you've invented a new, and quite definitive, metric for measuring the worth of D changes. :) Walter ?
Mar 31 2003
prev sibling parent reply Burton Radons <loth users.sourceforge.net> writes:
Bill Cox wrote:
 In article <b6aph7$1dbp$1 digitaldaemon.com>, Matthew Wilson says...
 
Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of D,
where achievable.

From a user point of view, I like the char null*. The single most common "Help!, I've crashed my simple D program" post on this newsgroup seems to have to do with the terminating null, and how it interacts with character array slicing.

The problems of newbies are eminently ignorable. It's the problems of people who are indoctrinated that are worth looking into, they're the ones who are going to be running into it in the years following. About the issue itself, uh... it's a good match for D (as set out at the top of the Phobos page), it's not a good match for what I want D to be. I don't like referring to C functions directly, because of incompatible signatures, lack of exceptions, weird overloading, and extreme operating system variations in Unices - for example, sometimes errno is a symbol, sometimes it's a macro calling a function. Purifying this variability is the first task of cross-platform work, which I do quite a lot of, and char* is one small factor of the problem. So altogether there's no win in it for me. toStringz shows up 38 times in the interface library dig, 0 times in the client program dedit. That's the way it should be.
Apr 01 2003
parent "Matthew Wilson" <dmd synesis.com.au> writes:
I don't get your point.

Without a DNI (i.e. D Native Interface) with which to approach D from the
underside, we are forced to have C compatibility within D itself, touching C
from the upperside, if you like. Frankly (and I guess this is because I'm a
pragmatist, eh Bill?) I don't care which it is, but I do think it's
important to maximise robustness wherever possible without cause any
significant degradation of performance.

As I've argued, this feature would most certainly increase robustness and
would also likely increase performance (quality of compiler optimisations
allowing).

You say that toStringz() shows up 38 times in your code, and then say
there's no win in it for you. This seems contradictory. Have I misunderstood
your post?

As for the "purity" of D, I'll have to leave that to those of a more
philosophical bent. I'd offer this thought, though: I have a friend who
works on the Solaris kernel team, and he tells me they're not thinking of
going C++ or Java or anything else other than C, for the "foreseeable
future" (which is a long time, I think). Walter's created a language to
supercede C (among others), but has wisely put C compatibility into it. It
being the case that C compatibility is built in to D, I cannot see the sense
in denying ourselves more robustness and efficiency for free, just because
it's less pure?


"Burton Radons" <loth users.sourceforge.net> wrote in message
news:b6c7cp$2bh2$1 digitaldaemon.com...
 Bill Cox wrote:
 In article <b6aph7$1dbp$1 digitaldaemon.com>, Matthew Wilson says...

Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of



where achievable.

From a user point of view, I like the char null*. The single most


 "Help!, I've crashed my simple D program" post on this newsgroup seems


 to do with the terminating null, and how it interacts with character


 slicing.

The problems of newbies are eminently ignorable. It's the problems of people who are indoctrinated that are worth looking into, they're the ones who are going to be running into it in the years following. About the issue itself, uh... it's a good match for D (as set out at the top of the Phobos page), it's not a good match for what I want D to be. I don't like referring to C functions directly, because of incompatible signatures, lack of exceptions, weird overloading, and extreme operating system variations in Unices - for example, sometimes errno is a symbol, sometimes it's a macro calling a function. Purifying this variability is the first task of cross-platform work, which I do quite a lot of, and char* is one small factor of the problem. So altogether there's no win in it for me. toStringz shows up 38 times in the interface library dig, 0 times in the client program dedit. That's the way it should be.

Apr 01 2003
prev sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
Matthew please post in the other thread if you want me to respond.  That's why I
started it.

Mark
Mar 31 2003
parent "Matthew Wilson" <dmd synesis.com.au> writes:
Can't remember which bit of which post applies to which thread. Verbal
diarrhoea, I'm afraid.

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6av14$1h2u$1 digitaldaemon.com...
 Matthew please post in the other thread if you want me to respond.  That's

 started it.

 Mark

Mar 31 2003
prev sibling next sibling parent "Walter" <walter digitalmars.com> writes:
"Bill Cox" <bill viasic.com> wrote in message
news:3E88BE91.6010403 viasic.com...
 I would want to hold built-in string support to just UTF-8.  D could
 offer some support for the other formats through conversion routines in
 a standard library.  Having a single string format would surely be
 simpler than supporting them all.

That's the direction D is going.
Mar 31 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Bill Cox" <bill viasic.com> wrote in message
news:3E88BE91.6010403 viasic.com...
 I would want to hold built-in string support to just UTF-8.  D could
 offer some support for the other formats through conversion routines in
 a standard library.  Having a single string format would surely be
 simpler than supporting them all.

The next release will provide a module to do all the conversions.
May 24 2003
parent Mark Evans <Mark_member pathlink.com> writes:
The next release will provide a module to do all the conversions.

What it won't provide are manipulation routines for the results. Conversions aren't enough, one wants a consistent design that treats strings the same no matter their encoding. Mark
May 25 2003