D - Automatic Safe and Efficient Sz-ing

Matthew Wilson (46/46) Mar 31 2003 An idea I had in my sleep, so please forgive if I've overlooked some hug...

Bill Cox (9/12) Mar 31 2003 This is a rare occasion when I agree with Mark. The fact that a

Matthew Wilson (26/38) Mar 31 2003 :)

Matthew Wilson (8/51) Mar 31 2003 Correction: meant UCS-/UTF-16, not 32
Bill Cox (9/17) Mar 31 2003 From a user point of view, I like the char null*. The single most commo...

Matthew Wilson (5/8) Mar 31 2003 Good point. Maybe you've invented a new, and quite definitive, metric fo...
Burton Radons (15/31) Apr 01 2003 The problems of newbies are eminently ignorable. It's the problems of

Matthew Wilson (28/59) Apr 01 2003 I don't get your point.

Mark Evans (3/3) Mar 31 2003 Matthew please post in the other thread if you want me to respond. That...

Matthew Wilson (5/8) Mar 31 2003 Can't remember which bit of which post applies to which thread. Verbal

Walter (3/7) Mar 31 2003 That's the direction D is going.
Walter (3/7) May 24 2003 The next release will provide a module to do all the conversions.

Mark Evans (4/5) May 25 2003 What it won't provide are manipulation routines for the results. Conver...

"Matthew Wilson" <dmd synesis.com.au> writes:

An idea I had in my sleep, so please forgive if I've overlooked some huge
obvious beastie.

When interfacing a character array (btw, I'm with Mark in thinking we should
have a separate string class, but have not amassed my ammunition so am not
looking to engage in that debate yet)
to a C API expecting a null string, we have the options of

- not terminating - crash!
- terminating in the array via ~= (char)0;
- using toStringz() which seems from the implementation to contain most of
my sleepytime ideas for an efficient placement of a terminating null. Gah!

Nonetheless, I was wondering whether there was some way of making this call
implicit, perhaps in the declaration of the C function.

For example, strlen is declared thus

extern (C)
{
    int strlen(char *);
}

Would it be a nice thing to declare it

extern (C)
{
    int strlen(char null *);
}

and the D compiler would insert a call to toStringz() automatically?

Sure there is an efficiency argument against, but I suspect most of such C
calls that expect ZTS have to involve some similar treatment.

And really, the null decorator would not mean that "the compiler must call
toStringz", rather it could mean that "the compiler must ensure that the
string is zero-terminated". Hence the compiler would be free to optimise out
such a call where it is dealing with a literal, or static, or something that
it's already established is null terminated. For example, the code

void blah(char[] s)
{
    int    len1    =    strlen(s);
    int    len2    =    strlen(s);
}

Could be translated to

void blah(char[] s)
{
    char[]    s_zt    =    toStringz(s);
    int        len1    =    strlen(s_zt);
    int        len2    =    strlen(s_zt);
}

This would eradicate many of the problems that are likely to bite people
interfacing to C code, without in any way adding a cost to "pure" D.

Any takers?

Matthew

Mar 31 2003

Bill Cox <bill viasic.com> writes:

Hi, Matthew.

 When interfacing a character array (btw, I'm with Mark in thinking we should
 have a separate string class, but have not amassed my ammunition so am not
 looking to engage in that debate yet)

This is a rare occasion when I agree with Mark.  The fact that a 
minimalist like me, and a maximalist like Mark, and a pragmatist like 
yourself seem to agree is something Walter should consider.

I would want to hold built-in string support to just UTF-8.  D could 
offer some support for the other formats through conversion routines in 
a standard library.  Having a single string format would surely be 
simpler than supporting them all.

Bill

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

:)

Pragmatist is a lot more of a compliment than what I usually get: pedant.

Yes, the string stuff is highly toxic in C, C++ and (it seems) D. I am also,
however, wary of building in support for inefficient (in terms of speed, not
size) variable character length encoding schemes.

Is there are reason why UCS-32 (or is that UTF-32 - I need to go and digest
all that awful gunk again and get my terminology back up to speed), a la
wchar_t, Java, .NETis not sufficient?

I know that 65536 doesn't cover all the bases of _all_ languages, but it is
nevertheless used as a "complete" solution by so many languages, so is it
"near enough is good enough". Dunno, seems Mark's much more of an expert, so
hopefully he can enlighten me on that one.


Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of D,
where achievable.

Sure, fire away, but I think we should have it running for parliament. ;)

Percy the pragmatist

"Bill Cox" <bill viasic.com> wrote in message
news:3E88BE91.6010403 viasic.com...
 Hi, Matthew.

 When interfacing a character array (btw, I'm with Mark in thinking we


should
 have a separate string class, but have not amassed my ammunition so am


not
 looking to engage in that debate yet)

 This is a rare occasion when I agree with Mark.  The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist like
 yourself seem to agree is something Walter should consider.

 I would want to hold built-in string support to just UTF-8.  D could
 offer some support for the other formats through conversion routines in
 a standard library.  Having a single string format would surely be
 simpler than supporting them all.

 Bill

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

Correction: meant UCS-/UTF-16, not 32

"Matthew Wilson" <dmd synesis.com.au> wrote in message
news:b6aph7$1dbp$1 digitaldaemon.com...
 :)

 Pragmatist is a lot more of a compliment than what I usually get: pedant.

 Yes, the string stuff is highly toxic in C, C++ and (it seems) D. I am

also,
 however, wary of building in support for inefficient (in terms of speed,

not
 size) variable character length encoding schemes.

 Is there are reason why UCS-32 (or is that UTF-32 - I need to go and

digest
 all that awful gunk again and get my terminology back up to speed), a la
 wchar_t, Java, .NETis not sufficient?

 I know that 65536 doesn't cover all the bases of _all_ languages, but it

is
 nevertheless used as a "complete" solution by so many languages, so is it
 "near enough is good enough". Dunno, seems Mark's much more of an expert,

so
 hopefully he can enlighten me on that one.


 Anyway, Bill, everyone, do you like the "char null *" idea?
 - Doesn't introduce another keyword.
 - Surely not hard to parse.
 - Improves robustness.
 - Doesn't add operations that would not have to be done anyway.
 - Leaves it all to compiler's best discretion, so plenty of chances for
 being _faster_ than leaving it up to user, which seems to be a theme of D,
 where achievable.

 Sure, fire away, but I think we should have it running for parliament. ;)

 Percy the pragmatist

 "Bill Cox" <bill viasic.com> wrote in message
 news:3E88BE91.6010403 viasic.com...
 Hi, Matthew.

 When interfacing a character array (btw, I'm with Mark in thinking we


 should
 have a separate string class, but have not amassed my ammunition so am


 not
 looking to engage in that debate yet)

 This is a rare occasion when I agree with Mark.  The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist like
 yourself seem to agree is something Walter should consider.

 I would want to hold built-in string support to just UTF-8.  D could
 offer some support for the other formats through conversion routines in
 a standard library.  Having a single string format would surely be
 simpler than supporting them all.

 Bill

Mar 31 2003

Bill Cox <Bill_member pathlink.com> writes:

In article <b6aph7$1dbp$1 digitaldaemon.com>, Matthew Wilson says...
Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of D,
where achievable.

From a user point of view, I like the char null*.  The single most common
"Help!, I've crashed my simple D program" post on this newsgroup seems to have
to do with the terminating null, and how it interacts with character array
slicing.

I'd be nice to help clear that one up.  I don't know how hard the support would
be.  I'd have to be pretty hard to amount to more of Walter's time than dealing
with the confused D users.

Bill

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

 I'd be nice to help clear that one up.  I don't know how hard the support

would
 be.  I'd have to be pretty hard to amount to more of Walter's time than

dealing
 with the confused D users.

Good point. Maybe you've invented a new, and quite definitive, metric for
measuring the worth of D changes. :)

Walter ?

Mar 31 2003

Burton Radons <loth users.sourceforge.net> writes:

Bill Cox wrote:
 In article <b6aph7$1dbp$1 digitaldaemon.com>, Matthew Wilson says...
 
Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of D,
where achievable.

 
 
 From a user point of view, I like the char null*.  The single most common
 "Help!, I've crashed my simple D program" post on this newsgroup seems to have
 to do with the terminating null, and how it interacts with character array
 slicing.

The problems of newbies are eminently ignorable.  It's the problems of 
people who are indoctrinated that are worth looking into, they're the 
ones who are going to be running into it in the years following.

About the issue itself, uh... it's a good match for D (as set out at the 
top of the Phobos page), it's not a good match for what I want D to be. 
  I don't like referring to C functions directly, because of 
incompatible signatures, lack of exceptions, weird overloading, and 
extreme operating system variations in Unices - for example, sometimes 
errno is a symbol, sometimes it's a macro calling a function.  Purifying 
this variability is the first task of cross-platform work, which I do 
quite a lot of, and char* is one small factor of the problem.

So altogether there's no win in it for me.  toStringz shows up 38 times 
in the interface library dig, 0 times in the client program dedit. 
That's the way it should be.

Apr 01 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

I don't get your point.

Without a DNI (i.e. D Native Interface) with which to approach D from the
underside, we are forced to have C compatibility within D itself, touching C
from the upperside, if you like. Frankly (and I guess this is because I'm a
pragmatist, eh Bill?) I don't care which it is, but I do think it's
important to maximise robustness wherever possible without cause any
significant degradation of performance.

As I've argued, this feature would most certainly increase robustness and
would also likely increase performance (quality of compiler optimisations
allowing).

You say that toStringz() shows up 38 times in your code, and then say
there's no win in it for you. This seems contradictory. Have I misunderstood
your post?

As for the "purity" of D, I'll have to leave that to those of a more
philosophical bent. I'd offer this thought, though: I have a friend who
works on the Solaris kernel team, and he tells me they're not thinking of
going C++ or Java or anything else other than C, for the "foreseeable
future" (which is a long time, I think). Walter's created a language to
supercede C (among others), but has wisely put C compatibility into it. It
being the case that C compatibility is built in to D, I cannot see the sense
in denying ourselves more robustness and efficiency for free, just because
it's less pure?


"Burton Radons" <loth users.sourceforge.net> wrote in message
news:b6c7cp$2bh2$1 digitaldaemon.com...
 Bill Cox wrote:
 In article <b6aph7$1dbp$1 digitaldaemon.com>, Matthew Wilson says...

Anyway, Bill, everyone, do you like the "char null *" idea?
- Doesn't introduce another keyword.
- Surely not hard to parse.
- Improves robustness.
- Doesn't add operations that would not have to be done anyway.
- Leaves it all to compiler's best discretion, so plenty of chances for
being _faster_ than leaving it up to user, which seems to be a theme of



D,
where achievable.


 From a user point of view, I like the char null*.  The single most


common
 "Help!, I've crashed my simple D program" post on this newsgroup seems


to have
 to do with the terminating null, and how it interacts with character


array
 slicing.

 The problems of newbies are eminently ignorable.  It's the problems of
 people who are indoctrinated that are worth looking into, they're the
 ones who are going to be running into it in the years following.

 About the issue itself, uh... it's a good match for D (as set out at the
 top of the Phobos page), it's not a good match for what I want D to be.
   I don't like referring to C functions directly, because of
 incompatible signatures, lack of exceptions, weird overloading, and
 extreme operating system variations in Unices - for example, sometimes
 errno is a symbol, sometimes it's a macro calling a function.  Purifying
 this variability is the first task of cross-platform work, which I do
 quite a lot of, and char* is one small factor of the problem.

 So altogether there's no win in it for me.  toStringz shows up 38 times
 in the interface library dig, 0 times in the client program dedit.
 That's the way it should be.

Apr 01 2003

Mark Evans <Mark_member pathlink.com> writes:

Matthew please post in the other thread if you want me to respond.  That's why I
started it.

Mark

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

Can't remember which bit of which post applies to which thread. Verbal
diarrhoea, I'm afraid.

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6av14$1h2u$1 digitaldaemon.com...
 Matthew please post in the other thread if you want me to respond.  That's

why I
 started it.

 Mark

Mar 31 2003

"Walter" <walter digitalmars.com> writes:

"Bill Cox" <bill viasic.com> wrote in message
news:3E88BE91.6010403 viasic.com...
 I would want to hold built-in string support to just UTF-8.  D could
 offer some support for the other formats through conversion routines in
 a standard library.  Having a single string format would surely be
 simpler than supporting them all.

That's the direction D is going.

Mar 31 2003

"Walter" <walter digitalmars.com> writes:

"Bill Cox" <bill viasic.com> wrote in message
news:3E88BE91.6010403 viasic.com...
 I would want to hold built-in string support to just UTF-8.  D could
 offer some support for the other formats through conversion routines in
 a standard library.  Having a single string format would surely be
 simpler than supporting them all.

The next release will provide a module to do all the conversions.

May 24 2003

Mark Evans <Mark_member pathlink.com> writes:

The next release will provide a module to do all the conversions.

What it won't provide are manipulation routines for the results.  Conversions
aren't enough, one wants a consistent design that treats strings the same no
matter their encoding.

Mark

May 25 2003

D Programming

C/C++ Programming

Other

D - Automatic Safe and Efficient Sz-ing