D - == for char[]

D - == for char[] - broken

Matthew Wilson (47/47) Oct 11 2003 This is something that's going to come up again and again.

Charles Sanders (8/55) Oct 11 2003 Eeep! I defintly think this deserves top priority. Although Im a littl...

Matthew Wilson (21/91) Oct 11 2003 I have no idea. It will be something inside the exception infrastructure

Walter (47/96) Oct 11 2003 What is happening here is trying to simultaneously use two different

Matthew Wilson (47/152) Oct 11 2003 Walter

Matthew Wilson (27/197) Oct 11 2003 Gah!

Matthew Wilson (36/254) Oct 11 2003 Yep, that was it. Thank goodness we can slice a C array!

Walter (6/9) Oct 11 2003 the

Hauke Duden (23/30) Oct 12 2003 C-array

Walter (11/30) Oct 12 2003 3rd

Hauke Duden (13/15) Oct 12 2003 That's not what I meant. The main problem is that D strings are by defau...

Walter (26/41) Oct 12 2003 I see what you mean now, but that's not strictly true. The null terminat...

Hauke Duden (28/49) Oct 12 2003 termination

Walter (28/198) Oct 11 2003 You *are* allocating extra storage - the

"Matthew Wilson" <matthew stlsoft.org> writes:

This is something that's going to come up again and again.

I'm just writing some unittests for the registry module, along the lines of

    // (ii) Catch that can throw and be caught by Exception
    {
        char[]  message =   "Test 2";
        int     code    =   3;
        char[]  string  =   "Test 2 (3)";

        try
        {
            throw new Win32Exception("Test 2", code);
        }
        catch(Exception x)
        {
            if(string != x.toString())
            {
                printf( "UnitTest failure for Win32Exception:\n"
                        "  x.toString() [%d;\"%.*s\"] does not equal
[%d;\"%.*s\"]\n"
                    ,   x.toString().length, x.toString()
                    ,   string.length, string);
            }
            assert(string == x.toString());
        }
    }

The test fires with


UnitTest failure for Win32Exception:
  x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
Error: Assertion Failure registry(180)

In other words, the strings are the same, but the arrays are not. I
understand what's going on here, and why the current support for arrays of
char are implementated as they are, but this is just going to keep
recurring: you can't expect people to use char[] for strings and not apply
== and != to them. It's just asking more than human nature can give.

Possible solutions:

1. Always ensure that length or a char[] represents the C-style length. This
would not allow a null terminator (so it'd be pretty hard to make it work,
no?), not to mention proscribing the often useful technique of
(pre-)allocating beyond the current extent.
2. implement == and != different for char[] than for the other arrays. After
all, if you're using char[] to hold bytes (not characters), you're ([mod:
expletives deleted]) and not availing yourselves of the full power of D's
extended (over C and C++) types.
3. Provide a separate string class that works in a "sensible" way.

Since is unworkable, and I'm out of inspiration, let's discuss 2 and 3.
Unless I'm in a minority of one again in believing this is another
imperfection in the language that's going to be a constant source of
problems.

Oct 11 2003

"Charles Sanders" <sanders-consulting comcast.net> writes:

Eeep!  I defintly think this deserves top priority.  Although Im a little
confused, why is the length of  x.ToString() 30 ?  And also wouldn't this
problem apply to other arrays of any type ?



"Matthew Wilson" <matthew stlsoft.org> wrote in message
news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the lines

of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not. I
 understand what's going on here, and why the current support for arrays of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not apply
 == and != to them. It's just asking more than human nature can give.

 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style length.

This
 would not allow a null terminator (so it'd be pretty hard to make it work,
 no?), not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.
 2. implement == and != different for char[] than for the other arrays.

After
 all, if you're using char[] to hold bytes (not characters), you're ([mod:
 expletives deleted]) and not availing yourselves of the full power of D's
 extended (over C and C++) types.
 3. Provide a separate string class that works in a "sensible" way.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and 3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

Oct 11 2003

"Matthew Wilson" <matthew stlsoft.org> writes:

 Eeep!  I defintly think this deserves top priority.

Agreed

  Although Im a little
 confused, why is the length of  x.ToString() 30 ?

I have no idea. It will be something inside the exception infrastructure
that does it.

But the fact is, it is easy to have a "string" lie about its length

char[] s = "A nice string";
char[] badChar = new char[1];
badChar[0] = 0;

for(int i = 0; i < 10; ++i)
{
  s ~= badChar;
}

printf("%d;%.*s\n", s.length, s);

What do you think that should print? You get

    "23;A nice string" - kind of nasty

  And also wouldn't this
 problem apply to other arrays of any type ?

No, because other types don't use 0 as a terminal marker, as character
strings do.


 "Matthew Wilson" <matthew stlsoft.org> wrote in message
 news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the lines

 of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not. I
 understand what's going on here, and why the current support for arrays


of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not


apply
 == and != to them. It's just asking more than human nature can give.

 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style length.

 This
 would not allow a null terminator (so it'd be pretty hard to make it


work,
 no?), not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.
 2. implement == and != different for char[] than for the other arrays.

 After
 all, if you're using char[] to hold bytes (not characters), you're


([mod:
 expletives deleted]) and not availing yourselves of the full power of


D's
 extended (over C and C++) types.
 3. Provide a separate string class that works in a "sensible" way.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and 3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

Oct 11 2003

"Walter" <walter digitalmars.com> writes:

What is happening here is trying to simultaneously use two different
representations of strings, one with an explicit length, and one with a 0
termination. If you are going to use both in the same array, you'll need to
set the explicit length properly.

You're also seeing an artifact of printf's "%.*s" format where the * is
taken to be the maximum length, not the minimum length. printf is still a C
function, and quits when it sees a 0 byte. What your string really is is:
    "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
and a D printf would print it that way.

Alternatively, what you can do is:
1) Use char[] for a D string.
2) Use char* for a C string.
Which should be clear to anyone examining the code.

More comments embedded.

"Matthew Wilson" <matthew stlsoft.org> wrote in message
news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the lines

of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not.

No, the strings are not the same. printf just quit when it saw a '\0'. To
compare null terminated strings, slice them to set the length properly, or
use strcmp().

 I
 understand what's going on here, and why the current support for arrays of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not apply
 == and != to them. It's just asking more than human nature can give.

 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style length.

This
 would not allow a null terminator (so it'd be pretty hard to make it work,
 no?),

Not a problem, since slices work!
    string=string[0..strlen(string)];
takes a slice of the existing array,
the terminating 0 does not go away. The .length property of an array is
*not* the allocated length, it is only guaranteed to be <= the allocated
length.

 not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.

Not at all. The .length is not the allocated length of an array. It's the
length of slice of the allocated length. Don't worry about the allocated
length, the gc will manage that for you. Only by explicitly changing the
.length property is it possible that the allocated length changes; merely
doing a slice is guaranteed to not affect the allocated length. (And it
could not, otherwise the semantics and usefulness of slices completely
disintegrates.)

 2. implement == and != different for char[] than for the other arrays.

After
 all, if you're using char[] to hold bytes (not characters), you're ([mod:
 expletives deleted]) and not availing yourselves of the full power of D's
 extended (over C and C++) types.

I think that will cause more confusion. Consistency is worth a great deal.

 3. Provide a separate string class that works in a "sensible" way.

I can see if someone wants a C string class, but I'd call it Cstrings or
ASCIZ strings.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and 3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

There will always be some details to deal with when trying to use two
different string representations with one format. The solution is to use D
strings throughout the program, only converting to a C string when calling a
C API function using toStringz(), and when receiving a string from a C API
function, immediately convert it to a D string using:
    string = string[0..strlen(string)];
and I think the problems you're having will disappear. Alternatively, use
char[] for D strings, and char* for C strings.

I should make a FAQ entry for this. <g>

Oct 11 2003

"Matthew Wilson" <matthew stlsoft.org> writes:

Walter

You've mistaken what I was saying.

Of course I understand that a D string can contain embedded NULLs, and I
*assure* you that I grok the difference between a C and a D string.

You're actually explaining the reverse situation. If you look at the example
code, you can quite clearly see that the problem is the reverse of what (I
think that) you think I'm thinking.

The constructor for a Win32Exception does the following:

    this(char[] message, int error)
    {
        char    sz[24];

        wsprintfA(sz, " (%d)", error);

        m_message = message;
        m_error = error;

        super(message ~ sz);
    }

In the invariant, the code passes "Test 2" and 3 to the ctor, and then
checks that the message from the caught exception is "Test 2 (3)".

If it does this by comparing them as D strings, the comparison fails. This
is demonstrated by the fact that the assertion fails but printf prints them
as equal. As you say, it should print

    "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

The problem is, I've written the Win32Exception ctor, and I've written the
test case. In no part of the code have I requested any extra storage, and
yet the two strings are not equal! Hence, I cannot trust D strings to be
strings, or at least the current implementation of the exception-handling
mechanism is broken.

This is the specific case. In the general case, to which I admit much of
your response is persuasive, I still think there is a problem, but that is
more of a user expectation than a bona fide flaw.

Can you address my specific problem? My workaround has been to use the
function string_equal(), but that chews

boolean string_equal(char[] s1, char[] s2)
{
    return 0 == strcmp(toStringz(s1), toStringz(s2));
}

Matthew



"Walter" <walter digitalmars.com> wrote in message
news:bma49n$2g22$1 digitaldaemon.com...
 What is happening here is trying to simultaneously use two different
 representations of strings, one with an explicit length, and one with a 0
 termination. If you are going to use both in the same array, you'll need

to
 set the explicit length properly.

 You're also seeing an artifact of printf's "%.*s" format where the * is
 taken to be the maximum length, not the minimum length. printf is still a

C
 function, and quits when it sees a 0 byte. What your string really is is:
     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
 and a D printf would print it that way.

 Alternatively, what you can do is:
 1) Use char[] for a D string.
 2) Use char* for a C string.
 Which should be clear to anyone examining the code.

 More comments embedded.

 "Matthew Wilson" <matthew stlsoft.org> wrote in message
 news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the lines

 of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not.

 No, the strings are not the same. printf just quit when it saw a '\0'. To
 compare null terminated strings, slice them to set the length properly, or
 use strcmp().

 I
 understand what's going on here, and why the current support for arrays


of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not


apply
 == and != to them. It's just asking more than human nature can give.

 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style length.

 This
 would not allow a null terminator (so it'd be pretty hard to make it


work,
 no?),

 Not a problem, since slices work!
     string=string[0..strlen(string)];
 takes a slice of the existing array,
 the terminating 0 does not go away. The .length property of an array is
 *not* the allocated length, it is only guaranteed to be <= the allocated
 length.

 not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.

 Not at all. The .length is not the allocated length of an array. It's the
 length of slice of the allocated length. Don't worry about the allocated
 length, the gc will manage that for you. Only by explicitly changing the
 .length property is it possible that the allocated length changes; merely
 doing a slice is guaranteed to not affect the allocated length. (And it
 could not, otherwise the semantics and usefulness of slices completely
 disintegrates.)

 2. implement == and != different for char[] than for the other arrays.

 After
 all, if you're using char[] to hold bytes (not characters), you're


([mod:
 expletives deleted]) and not availing yourselves of the full power of


D's
 extended (over C and C++) types.

 I think that will cause more confusion. Consistency is worth a great deal.

 3. Provide a separate string class that works in a "sensible" way.

 I can see if someone wants a C string class, but I'd call it Cstrings or
 ASCIZ strings.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and 3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

 There will always be some details to deal with when trying to use two
 different string representations with one format. The solution is to use D
 strings throughout the program, only converting to a C string when calling

a
 C API function using toStringz(), and when receiving a string from a C API
 function, immediately convert it to a D string using:
     string = string[0..strlen(string)];
 and I think the problems you're having will disappear. Alternatively, use
 char[] for D strings, and char* for C strings.

 I should make a FAQ entry for this. <g>

Oct 11 2003

"Matthew Wilson" <matthew stlsoft.org> writes:

Gah!

Hoisted again.

Does

       super(message ~ sz);

add the whole length of sz? I guess the answer is a sorry yes.

Well, the specifc case is answered, but the general one gathers weight. :(

(And I look like a tool in public once again ...)

"Matthew Wilson" <matthew stlsoft.org> wrote in message
news:bma5kj$2hv1$1 digitaldaemon.com...
 Walter

 You've mistaken what I was saying.

 Of course I understand that a D string can contain embedded NULLs, and I
 *assure* you that I grok the difference between a C and a D string.

 You're actually explaining the reverse situation. If you look at the

example
 code, you can quite clearly see that the problem is the reverse of what (I
 think that) you think I'm thinking.

 The constructor for a Win32Exception does the following:

     this(char[] message, int error)
     {
         char    sz[24];

         wsprintfA(sz, " (%d)", error);

         m_message = message;
         m_error = error;

         super(message ~ sz);
     }

 In the invariant, the code passes "Test 2" and 3 to the ctor, and then
 checks that the message from the caught exception is "Test 2 (3)".

 If it does this by comparing them as D strings, the comparison fails. This
 is demonstrated by the fact that the assertion fails but printf prints

them
 as equal. As you say, it should print

     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

 The problem is, I've written the Win32Exception ctor, and I've written the
 test case. In no part of the code have I requested any extra storage, and
 yet the two strings are not equal! Hence, I cannot trust D strings to be
 strings, or at least the current implementation of the exception-handling
 mechanism is broken.

 This is the specific case. In the general case, to which I admit much of
 your response is persuasive, I still think there is a problem, but that is
 more of a user expectation than a bona fide flaw.

 Can you address my specific problem? My workaround has been to use the
 function string_equal(), but that chews

 boolean string_equal(char[] s1, char[] s2)
 {
     return 0 == strcmp(toStringz(s1), toStringz(s2));
 }

 Matthew



 "Walter" <walter digitalmars.com> wrote in message
 news:bma49n$2g22$1 digitaldaemon.com...
 What is happening here is trying to simultaneously use two different
 representations of strings, one with an explicit length, and one with a


0
 termination. If you are going to use both in the same array, you'll need

 to
 set the explicit length properly.

 You're also seeing an artifact of printf's "%.*s" format where the * is
 taken to be the maximum length, not the minimum length. printf is still


a
 C
 function, and quits when it sees a 0 byte. What your string really is


is:
     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
 and a D printf would print it that way.

 Alternatively, what you can do is:
 1) Use char[] for a D string.
 2) Use char* for a C string.
 Which should be clear to anyone examining the code.

 More comments embedded.

 "Matthew Wilson" <matthew stlsoft.org> wrote in message
 news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the



lines
 of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not.

 No, the strings are not the same. printf just quit when it saw a '\0'.


To
 compare null terminated strings, slice them to set the length properly,


or
 use strcmp().

 I
 understand what's going on here, and why the current support for



arrays
 of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not


 apply
 == and != to them. It's just asking more than human nature can give.

 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style



length.
 This
 would not allow a null terminator (so it'd be pretty hard to make it


 work,
 no?),

 Not a problem, since slices work!
     string=string[0..strlen(string)];
 takes a slice of the existing array,
 the terminating 0 does not go away. The .length property of an array is
 *not* the allocated length, it is only guaranteed to be <= the allocated
 length.

 not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.

 Not at all. The .length is not the allocated length of an array. It's


the
 length of slice of the allocated length. Don't worry about the allocated
 length, the gc will manage that for you. Only by explicitly changing the
 .length property is it possible that the allocated length changes;


merely
 doing a slice is guaranteed to not affect the allocated length. (And it
 could not, otherwise the semantics and usefulness of slices completely
 disintegrates.)

 2. implement == and != different for char[] than for the other arrays.

 After
 all, if you're using char[] to hold bytes (not characters), you're


 ([mod:
 expletives deleted]) and not availing yourselves of the full power of


 D's
 extended (over C and C++) types.

 I think that will cause more confusion. Consistency is worth a great


deal.
 3. Provide a separate string class that works in a "sensible" way.

 I can see if someone wants a C string class, but I'd call it Cstrings or
 ASCIZ strings.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and



3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

 There will always be some details to deal with when trying to use two
 different string representations with one format. The solution is to use


D
 strings throughout the program, only converting to a C string when


calling
 a
 C API function using toStringz(), and when receiving a string from a C


API
 function, immediately convert it to a D string using:
     string = string[0..strlen(string)];
 and I think the problems you're having will disappear. Alternatively,


use
 char[] for D strings, and char* for C strings.

 I should make a FAQ entry for this. <g>

Oct 11 2003

"Matthew Wilson" <matthew stlsoft.org> writes:

Yep, that was it. Thank goodness we can slice a C array!

    this(char[] message, int error)
    {
        char    sz[24]; // Enough for the three " ()" characters and a
64-bit integer value
        int     cch = wsprintfA(sz, " (%d)", error);

        m_message = message;
        m_error   = error;

        super(message ~ sz[0 .. cch]);
    }

Now it works correctly.

I still maintain that this is a nasty waiting to catch the unwary. Maybe the
answer is to require that the ~/~= operators take a D array, or a C-array
slice, rather than a C array or char*. Is that workable?

"Matthew Wilson" <matthew stlsoft.org> wrote in message
news:bma5vq$2ier$1 digitaldaemon.com...
 Gah!

 Hoisted again.

 Does

        super(message ~ sz);

 add the whole length of sz? I guess the answer is a sorry yes.

 Well, the specifc case is answered, but the general one gathers weight. :(

 (And I look like a tool in public once again ...)

 "Matthew Wilson" <matthew stlsoft.org> wrote in message
 news:bma5kj$2hv1$1 digitaldaemon.com...
 Walter

 You've mistaken what I was saying.

 Of course I understand that a D string can contain embedded NULLs, and I
 *assure* you that I grok the difference between a C and a D string.

 You're actually explaining the reverse situation. If you look at the

 example
 code, you can quite clearly see that the problem is the reverse of what


(I
 think that) you think I'm thinking.

 The constructor for a Win32Exception does the following:

     this(char[] message, int error)
     {
         char    sz[24];

         wsprintfA(sz, " (%d)", error);

         m_message = message;
         m_error = error;

         super(message ~ sz);
     }

 In the invariant, the code passes "Test 2" and 3 to the ctor, and then
 checks that the message from the caught exception is "Test 2 (3)".

 If it does this by comparing them as D strings, the comparison fails.


This
 is demonstrated by the fact that the assertion fails but printf prints

 them
 as equal. As you say, it should print

     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

 The problem is, I've written the Win32Exception ctor, and I've written


the
 test case. In no part of the code have I requested any extra storage,


and
 yet the two strings are not equal! Hence, I cannot trust D strings to be
 strings, or at least the current implementation of the


exception-handling
 mechanism is broken.

 This is the specific case. In the general case, to which I admit much of
 your response is persuasive, I still think there is a problem, but that


is
 more of a user expectation than a bona fide flaw.

 Can you address my specific problem? My workaround has been to use the
 function string_equal(), but that chews

 boolean string_equal(char[] s1, char[] s2)
 {
     return 0 == strcmp(toStringz(s1), toStringz(s2));
 }

 Matthew



 "Walter" <walter digitalmars.com> wrote in message
 news:bma49n$2g22$1 digitaldaemon.com...
 What is happening here is trying to simultaneously use two different
 representations of strings, one with an explicit length, and one with



a
 0
 termination. If you are going to use both in the same array, you'll



need
 to
 set the explicit length properly.

 You're also seeing an artifact of printf's "%.*s" format where the *



is
 taken to be the maximum length, not the minimum length. printf is



still
 a
 C
 function, and quits when it sees a 0 byte. What your string really is


 is:
     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
 and a D printf would print it that way.

 Alternatively, what you can do is:
 1) Use char[] for a D string.
 2) Use char* for a C string.
 Which should be clear to anyone examining the code.

 More comments embedded.

 "Matthew Wilson" <matthew stlsoft.org> wrote in message
 news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the



 lines
 of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not.

 No, the strings are not the same. printf just quit when it saw a '\0'.


 To
 compare null terminated strings, slice them to set the length



properly,
 or
 use strcmp().

 I
 understand what's going on here, and why the current support for



 arrays
 of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not


 apply
 == and != to them. It's just asking more than human nature can give.




 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style



 length.
 This
 would not allow a null terminator (so it'd be pretty hard to make it


 work,
 no?),

 Not a problem, since slices work!
     string=string[0..strlen(string)];
 takes a slice of the existing array,
 the terminating 0 does not go away. The .length property of an array



is
 *not* the allocated length, it is only guaranteed to be <= the



allocated
 length.

 not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.

 Not at all. The .length is not the allocated length of an array. It's


 the
 length of slice of the allocated length. Don't worry about the



allocated
 length, the gc will manage that for you. Only by explicitly changing



the
 .length property is it possible that the allocated length changes;


 merely
 doing a slice is guaranteed to not affect the allocated length. (And



it
 could not, otherwise the semantics and usefulness of slices completely
 disintegrates.)

 2. implement == and != different for char[] than for the other




arrays.
 After
 all, if you're using char[] to hold bytes (not characters), you're


 ([mod:
 expletives deleted]) and not availing yourselves of the full power




of
 D's
 extended (over C and C++) types.

 I think that will cause more confusion. Consistency is worth a great


 deal.
 3. Provide a separate string class that works in a "sensible" way.

 I can see if someone wants a C string class, but I'd call it Cstrings



or
 ASCIZ strings.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and



 3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

 There will always be some details to deal with when trying to use two
 different string representations with one format. The solution is to



use
 D
 strings throughout the program, only converting to a C string when


 calling
 a
 C API function using toStringz(), and when receiving a string from a C


 API
 function, immediately convert it to a D string using:
     string = string[0..strlen(string)];
 and I think the problems you're having will disappear. Alternatively,


 use
 char[] for D strings, and char* for C strings.

 I should make a FAQ entry for this. <g>

Oct 11 2003

"Walter" <walter digitalmars.com> writes:

"Matthew Wilson" <matthew stlsoft.org> wrote in message
news:bma69s$2ipi$1 digitaldaemon.com...
 I still maintain that this is a nasty waiting to catch the unwary. Maybe

the
 answer is to require that the ~/~= operators take a D array, or a C-array
 slice, rather than a C array or char*. Is that workable?

I don't know how since there is no distinct C array type in D. I think a
better solution is to replace all the C functions that deal with strings
with corresponding D functions.

Oct 11 2003

"Hauke Duden" <H.NS.Duden gmx.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bmapvs$bdj$1 digitaldaemon.com...
 I still maintain that this is a nasty waiting to catch the unwary. Maybe

 the
 answer is to require that the ~/~= operators take a D array, or a


C-array
 slice, rather than a C array or char*. Is that workable?

 I don't know how since there is no distinct C array type in D. I think a
 better solution is to replace all the C functions that deal with strings
 with corresponding D functions.

That might be possible for functions in the runtime lib, but what about 3rd
party C libraries?

As I understand it, one of the major design goals of D is to be able to
easily interact with existing C code. And I agree with Matthew that the need
for this kind of manual conversion is very error-prone. I think this is
definitely a problem that needs to be addressed.

The more I think about it, the more I realize that we'll need a real string
class that takes care of such issues. Maybe that class could always add a
terminating zero that is not included in the length. That way we maintain
compatibility with all the existing string libraries and can still pull off
nifty stuff like having embedded zeros if you use your strings only with
pure D code.

Another thing: what about OS functions? The whole Win32 API expects
zero-terminated strings. And we cannot wrap everything into D functions, can
we? It gets even worse with COM interfaces. Since these use an object
oriented approach with a virtual function table, you cannot replace
individual functions with wrappers. You would have to wrap the whole object
with all its interfaces - which isn't possible, since you might not know all
the interfaces it supports!

Hauke

Oct 12 2003

"Walter" <walter digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:bmbgpn$19tm$1 digitaldaemon.com...
 That might be possible for functions in the runtime lib, but what about

3rd
 party C libraries?

 As I understand it, one of the major design goals of D is to be able to
 easily interact with existing C code. And I agree with Matthew that the

need
 for this kind of manual conversion is very error-prone. I think this is
 definitely a problem that needs to be addressed.

 The more I think about it, the more I realize that we'll need a real

string
 class that takes care of such issues. Maybe that class could always add a
 terminating zero that is not included in the length. That way we maintain
 compatibility with all the existing string libraries and can still pull

off
 nifty stuff like having embedded zeros if you use your strings only with
 pure D code.

 Another thing: what about OS functions? The whole Win32 API expects
 zero-terminated strings. And we cannot wrap everything into D functions,

can
 we? It gets even worse with COM interfaces. Since these use an object
 oriented approach with a virtual function table, you cannot replace
 individual functions with wrappers. You would have to wrap the whole

object
 with all its interfaces - which isn't possible, since you might not know

all
 the interfaces it supports!

You can use char* to interface with C code. They'll work fine as null
terminated strings.

Oct 12 2003

"Hauke Duden" <H.NS.Duden gmx.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bmc4uf$23qj$1 digitaldaemon.com...
 You can use char* to interface with C code. They'll work fine as null
 terminated strings.

That's not what I meant. The main problem is that D strings are by default
not null-terminated. If they were, then all the array operations
(concatenation, etc.) wouldn't produce proper strings. So you have to add a
terminating null to your strings just before passing them to a C function,
and remove it whenever you get a string back from a C function. Which is an
error-prone and strenuous thing to do.

You suggested that a solution to this issue would be to create D versions of
all C string functions, so that they can handle non-null-terminated strings.
This is what my reply was about - I tried to point out that this would be a
huge and (in the case of COM) sometimes impossible task.

Hauke

Oct 12 2003

"Walter" <walter digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:bmca96$2b66$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:bmc4uf$23qj$1 digitaldaemon.com...
 You can use char* to interface with C code. They'll work fine as null
 terminated strings.

 That's not what I meant. The main problem is that D strings are by default
 not null-terminated.

I see what you mean now, but that's not strictly true. The null termination
in C strings is by convention, it has nothing to do with the C core language
other than "literal strings" are null terminated. Literal strings in D are
null terminated, as well (the null is just not reflected in the .length
property). Hence, if you use char*, and take care to follow the C
conventions with it, just as you would in C, it will work like it does in C.

 If they were, then all the array operations
 (concatenation, etc.) wouldn't produce proper strings. So you have to add

a
 terminating null to your strings just before passing them to a C function,
 and remove it whenever you get a string back from a C function. Which is

an
 error-prone and strenuous thing to do.

Actually, you have to frequently manually insert the 0 in C strings too when
programming in C. It's error-prone and tedious (though not strenuous <g>).

 You suggested that a solution to this issue would be to create D versions

of
 all C string functions, so that they can handle non-null-terminated

strings.
 This is what my reply was about - I tried to point out that this would be

a
 huge and (in the case of COM) sometimes impossible task.

You're right, it would be impractical for COM. But COM also has multiple
representations for strings, like BSTR, LPSTR, OLESTR, etc. There's no way
to paper over all these things, one needs to examine each COM API function
when using it to be sure the right kind of string is passed. So I suggest
when using COM interfaces with C style null-terminated strings, use char*'s
(or wchar*'s or BSTR's or whatever) and use null-terminated strings in D. It
won't be any extra work than it would be in C.

I don't think there's a practical way to have strings be both length
specified and null terminated without throwing away much of the benefit of
length specified strings, and without creating all kinds of odd cases where
it doesn't work right anyway.

Oct 12 2003

"Hauke Duden" <H.NS.Duden gmx.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bmcd3q$2er9$1 digitaldaemon.com...
 I see what you mean now, but that's not strictly true. The null

termination
 in C strings is by convention, it has nothing to do with the C core

language
 other than "literal strings" are null terminated.

Yeah, you're right. However, it is a pretty strong convention, since almost
all C functions, bot in the RTL and in 3rd party libraries expect strings to
be null-terminated.

 Literal strings in D are
 null terminated, as well (the null is just not reflected in the .length
 property).

That's good to know! Is it guaranteed, or is it only a quirk in the current
implementation?

My main point, however, is that to be easily usable with C functions all
strings would have to be null-terminated, not just literal ones. The more I
think about this, the more I'm certain that there should be a standard
string class that overloads the operators to achieve this.

 You suggested that a solution to this issue would be to create D


versions
 of
 all C string functions, so that they can handle non-null-terminated

 strings.
 This is what my reply was about - I tried to point out that this would


be
 a
 huge and (in the case of COM) sometimes impossible task.

 You're right, it would be impractical for COM. But COM also has multiple
 representations for strings, like BSTR, LPSTR, OLESTR, etc. There's no way
 to paper over all these things, one needs to examine each COM API function
 when using it to be sure the right kind of string is passed.

You're right. Though most COM methods use wide char strings, so it might be
tempting to just pass a wchar array and forget the null-terminator handling.

 I don't think there's a practical way to have strings be both length
 specified and null terminated without throwing away much of the benefit of
 length specified strings, and without creating all kinds of odd cases

where
 it doesn't work right anyway.

I think it is possible. I have done something like that in C++ before. The
main problem is that splicing a part from an existing string should be done
without creating a copy of the data. This can be dealt with by postponing
the appending of a null terminator until a C compatible string is actually
needed. So you would be able to keep the benefits of length-specified
strings as long as you don't pass them into C code.

I think this is a pretty important issue that needs to be solved as soon as
possible. I'll try and implement a D string class that makes this kind of
thing easier when I can find some time.

Hauke

Oct 12 2003

"Walter" <walter digitalmars.com> writes:

You *are* allocating extra storage - the
    char sz[24];
creates a string 24 bytes long. The fix is to rewrite the super constructor
call from:
    super(message ~ sz);
to:
    super(message ~ sz[0..strlen(sz]);
and it should work fine.

"Matthew Wilson" <matthew stlsoft.org> wrote in message
news:bma5kj$2hv1$1 digitaldaemon.com...
 Walter

 You've mistaken what I was saying.

 Of course I understand that a D string can contain embedded NULLs, and I
 *assure* you that I grok the difference between a C and a D string.

 You're actually explaining the reverse situation. If you look at the

example
 code, you can quite clearly see that the problem is the reverse of what (I
 think that) you think I'm thinking.

 The constructor for a Win32Exception does the following:

     this(char[] message, int error)
     {
         char    sz[24];

         wsprintfA(sz, " (%d)", error);

         m_message = message;
         m_error = error;

         super(message ~ sz);
     }

 In the invariant, the code passes "Test 2" and 3 to the ctor, and then
 checks that the message from the caught exception is "Test 2 (3)".

 If it does this by comparing them as D strings, the comparison fails. This
 is demonstrated by the fact that the assertion fails but printf prints

them
 as equal. As you say, it should print

     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

 The problem is, I've written the Win32Exception ctor, and I've written the
 test case. In no part of the code have I requested any extra storage, and
 yet the two strings are not equal! Hence, I cannot trust D strings to be
 strings, or at least the current implementation of the exception-handling
 mechanism is broken.

 This is the specific case. In the general case, to which I admit much of
 your response is persuasive, I still think there is a problem, but that is
 more of a user expectation than a bona fide flaw.

 Can you address my specific problem? My workaround has been to use the
 function string_equal(), but that chews

 boolean string_equal(char[] s1, char[] s2)
 {
     return 0 == strcmp(toStringz(s1), toStringz(s2));
 }

 Matthew



 "Walter" <walter digitalmars.com> wrote in message
 news:bma49n$2g22$1 digitaldaemon.com...
 What is happening here is trying to simultaneously use two different
 representations of strings, one with an explicit length, and one with a


0
 termination. If you are going to use both in the same array, you'll need

 to
 set the explicit length properly.

 You're also seeing an artifact of printf's "%.*s" format where the * is
 taken to be the maximum length, not the minimum length. printf is still


a
 C
 function, and quits when it sees a 0 byte. What your string really is


is:
     "Test 2 (3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
 and a D printf would print it that way.

 Alternatively, what you can do is:
 1) Use char[] for a D string.
 2) Use char* for a C string.
 Which should be clear to anyone examining the code.

 More comments embedded.

 "Matthew Wilson" <matthew stlsoft.org> wrote in message
 news:bm9tdq$26up$1 digitaldaemon.com...
 This is something that's going to come up again and again.

 I'm just writing some unittests for the registry module, along the



lines
 of
     // (ii) Catch that can throw and be caught by Exception
     {
         char[]  message =   "Test 2";
         int     code    =   3;
         char[]  string  =   "Test 2 (3)";

         try
         {
             throw new Win32Exception("Test 2", code);
         }
         catch(Exception x)
         {
             if(string != x.toString())
             {
                 printf( "UnitTest failure for Win32Exception:\n"
                         "  x.toString() [%d;\"%.*s\"] does not equal
 [%d;\"%.*s\"]\n"
                     ,   x.toString().length, x.toString()
                     ,   string.length, string);
             }
             assert(string == x.toString());
         }
     }

 The test fires with


 UnitTest failure for Win32Exception:
   x.toString() [30;"Test 2 (3)"] does not equal [10;"Test 2 (3)"]
 Error: Assertion Failure registry(180)

 In other words, the strings are the same, but the arrays are not.

 No, the strings are not the same. printf just quit when it saw a '\0'.


To
 compare null terminated strings, slice them to set the length properly,


or
 use strcmp().

 I
 understand what's going on here, and why the current support for



arrays
 of
 char are implementated as they are, but this is just going to keep
 recurring: you can't expect people to use char[] for strings and not


 apply
 == and != to them. It's just asking more than human nature can give.

 Possible solutions:

 1. Always ensure that length or a char[] represents the C-style



length.
 This
 would not allow a null terminator (so it'd be pretty hard to make it


 work,
 no?),

 Not a problem, since slices work!
     string=string[0..strlen(string)];
 takes a slice of the existing array,
 the terminating 0 does not go away. The .length property of an array is
 *not* the allocated length, it is only guaranteed to be <= the allocated
 length.

 not to mention proscribing the often useful technique of
 (pre-)allocating beyond the current extent.

 Not at all. The .length is not the allocated length of an array. It's


the
 length of slice of the allocated length. Don't worry about the allocated
 length, the gc will manage that for you. Only by explicitly changing the
 .length property is it possible that the allocated length changes;


merely
 doing a slice is guaranteed to not affect the allocated length. (And it
 could not, otherwise the semantics and usefulness of slices completely
 disintegrates.)

 2. implement == and != different for char[] than for the other arrays.

 After
 all, if you're using char[] to hold bytes (not characters), you're


 ([mod:
 expletives deleted]) and not availing yourselves of the full power of


 D's
 extended (over C and C++) types.

 I think that will cause more confusion. Consistency is worth a great


deal.
 3. Provide a separate string class that works in a "sensible" way.

 I can see if someone wants a C string class, but I'd call it Cstrings or
 ASCIZ strings.

 Since is unworkable, and I'm out of inspiration, let's discuss 2 and



3.
 Unless I'm in a minority of one again in believing this is another
 imperfection in the language that's going to be a constant source of
 problems.

 There will always be some details to deal with when trying to use two
 different string representations with one format. The solution is to use


D
 strings throughout the program, only converting to a C string when


calling
 a
 C API function using toStringz(), and when receiving a string from a C


API
 function, immediately convert it to a D string using:
     string = string[0..strlen(string)];
 and I think the problems you're having will disappear. Alternatively,


use
 char[] for D strings, and char* for C strings.

 I should make a FAQ entry for this. <g>

Oct 11 2003

D Programming

C/C++ Programming

Other

D - == for char[] - broken