digitalmars.D.bugs - toUTFxx returns null references

Derek Parnell (30/30) Feb 10 2005 I do not know if this is a bug or not.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (18/42) Feb 10 2005 Confusing, but I don't really think it's a bug...

Derek (52/104) Feb 10 2005 If discovered this behaviour when I used an 'in' contract in a function ...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (39/51) Feb 10 2005 No,

Derek (17/19) Feb 10 2005 I'm testing for this ...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (17/34) Feb 10 2005 There is nothing wrong with using an unassigned string,

Derek (15/59) Feb 10 2005 Yes, I understand the technical aspect of this. However, I was attemptin...
Regan Heath (19/20) Feb 10 2005 There is a difference, internally, but D treats them the same. Which is ...

Derek Parnell (6/18) Feb 10 2005 Exactly! Well said.

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (28/59) Feb 11 2005 More or less, yes. But that's more of an Implementation Quirk™.

Regan Heath (33/88) Feb 13 2005 Which worries me because I believe there is a real need to tell them apa...

Derek Parnell <derek psych.ward> writes:

I do not know if this is a bug or not.

The toUTF32(), toUTF16(), and toUTF8() routines return a null reference if
the input parameter is an empty string. I would have thought that they
should return an empty string instead. The only exception is when the
parameter is the same type as the return value's type, in that case they
return an empty string.

Example code...
<code>
import std.utf;
import std.stdio;

void main()
{
   char[] s = "";
   dchar[] d;
   
   if (s is null) 
    writefln("s is null");
   else
    writefln("s length is %d", s.length);

   d = toUTF32(s);
   if (d is null) 
    writefln("d is null");
   else
    writefln("d length is %d", d.length);
} 
</code>

-- 
Derek
Melbourne, Australia
10/02/2005 7:28:31 PM

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Derek Parnell wrote:

 I do not know if this is a bug or not.

Confusing, but I don't really think it's a bug...

(maybe the std routines need to be more similar to
eachother, either all return null or all return "",
but both types of return values are OK to use, below:)

 The toUTF32(), toUTF16(), and toUTF8() routines return a null reference if
 the input parameter is an empty string. I would have thought that they
 should return an empty string instead. The only exception is when the
 parameter is the same type as the return value's type, in that case they
 return an empty string.

I believe that in D, the empty string is "equal" to null.

http://www.digitalmars.com/d/cppstrings.html:
  In D, an empty string is just null:
 
 	char[] str;
 	if (!str)
 		// string is empty

That works the same with either null or "", and this too:

 import std.stdio;
 
 void main()
 {
    char[] s = "";
    char[] d = null;
    
    writefln("s is %snull", s is null ? "" : "not ");
    writefln("s length is %d", s.length);
 
    writefln("d is %snull", d is null ? "" : "not ");
    writefln("d length is %d", d.length);
 } 

s is not null
s length is 0
d is null
d length is 0

Which means that whether it is "" or null, it'll compare
and work the same to the rest of code ? Unless C is involved,
since s.ptr will point to a '\0', but d.ptr points to null.

But that will work itself out in the toStringz process...
(since D strings have to be zero-terminate for C anyway)

--anders

Feb 10 2005

Derek <derek psych.ward> writes:

On Thu, 10 Feb 2005 09:59:39 +0100, Anders F Bj�rklund wrote:

 Derek Parnell wrote:
 
 I do not know if this is a bug or not.

 
 Confusing, but I don't really think it's a bug...
 
 (maybe the std routines need to be more similar to
 eachother, either all return null or all return "",
 but both types of return values are OK to use, below:)
 
 The toUTF32(), toUTF16(), and toUTF8() routines return a null reference if
 the input parameter is an empty string. I would have thought that they
 should return an empty string instead. The only exception is when the
 parameter is the same type as the return value's type, in that case they
 return an empty string.

 
 I believe that in D, the empty string is "equal" to null.
 
 http://www.digitalmars.com/d/cppstrings.html:
  In D, an empty string is just null:
 
 	char[] str;
 	if (!str)
 		// string is empty

 
 That works the same with either null or "", and this too:
 
 import std.stdio;
 
 void main()
 {
    char[] s = "";
    char[] d = null;
    
    writefln("s is %snull", s is null ? "" : "not ");
    writefln("s length is %d", s.length);
 
    writefln("d is %snull", d is null ? "" : "not ");
    writefln("d length is %d", d.length);
 } 

 
 s is not null
 s length is 0
 d is null
 d length is 0
 
 Which means that whether it is "" or null, it'll compare
 and work the same to the rest of code ? Unless C is involved,
 since s.ptr will point to a '\0', but d.ptr points to null.
 
 But that will work itself out in the toStringz process...
 (since D strings have to be zero-terminate for C anyway)

If discovered this behaviour when I used an 'in' contract in a function ...

  bool foo(dchar[] X, dchar[] Y)
  in {
    assert( ! (X is null) );
    assert( ! (Y is null) );
 }
 body { . . .  }


So what you seem to be saying is that I shouldn't bother checking that a
dynamic array reference is null or not. Instead I can just check the
length. However, I was trying to trap the case in which the function was
called with an uninitialized array. Calling it with a empty array is ok
though.

A fuller example in which it tripped me up ...

<code>
import std.utf;
import std.stdio;

bool foo(dchar[] X, dchar[] Y)
  in {
    assert( ! (X is null) );
    assert( ! (Y is null) );
 }
 body { 
     return true;  }

bool foo(char[] X, char[] Y)
{
   return foo( toUTF32(X), toUTF32(Y) );
}

bool foo(wchar[] X, wchar[] Y)
{
   return foo( toUTF32(X), toUTF32(Y) );
}

unittest {
   dchar[] a;
   dchar[] b;

   a = "";
   b = "123";
   debug(1) writefln("UT1");
   assert( foo(toUTF32(a), toUTF32(b) ) );
   
   debug(1) writefln("UT2");
   assert( foo(toUTF16(a), toUTF16(b) ) );
   
   debug(1) writefln("UT3");
   assert( foo(toUTF8(a),  toUTF8(b) ) );
}

</code>

Compiled with    dmd test -debug -unittest



-- 
Derek
Melbourne, Australia

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Derek wrote:

 So what you seem to be saying is that I shouldn't bother checking that a
 dynamic array reference is null or not. Instead I can just check the
 length. However, I was trying to trap the case in which the function was
 called with an uninitialized array. Calling it with a empty array is ok
 though.

No,
I don't think you should bother to differ between null and .length == 0.

 bool foo(dchar[] X, dchar[] Y)
   in {
     assert( ! (X is null) );
     assert( ! (Y is null) );
  }
  body { 
      return true;  }

The "recommended" way to write that is:

	assert(X);
	assert(Y);

Since D doesn't have booleans, that is ?
(and since the long form is an eye-sore)


I'm not sure what you are trying to test, but:

int main()
{
   char[] nullstr = null;
   assert(nullstr == "");
   assert("" == nullstr);
   return 0;
}

This test does not fail, and does not segfault...
(like it would have done if nullstr was an Object:)

int main()
{
   Object nullobj = null;
   assert(nullobj == null); // <-- KABOOM
   assert(null == nullobj); // <-- KABOOM
   return 0;
}

This second program *must* be rewritten with "is".
(since using '==' with class objects calls opEquals)

Pointers are OK too:

int main()
{
   void* nullptr = null;
   assert(nullptr == null);
   assert(null == nullptr);
   return 0;
}

To be on the safe side, one can use "is" always...
(i.e. with pointers/objects, but *not* with strings
since that only compares the references, like in Java)

--anders

Feb 10 2005

Derek <derek psych.ward> writes:

On Thu, 10 Feb 2005 14:10:47 +0100, Anders F Bj�rklund wrote:

 
 I'm not sure what you are trying to test, but:

I'm testing for this ...
 
 void main()
 {
    char[] nullstr;

    assert( ! (nullstr is null) );
 }

Namely, the attempted use of a string that has never had any assignment
yet.

But as toUTFxx() returns that something that looks like an unassigned
string, I can't test for unassigned strings. 

I still think that the toUTFxx() functions should return an empty string if
an empty string was passed to them.

-- 
Derek
Melbourne, Australia

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Derek wrote:

I'm not sure what you are trying to test, but:

 
 I'm testing for this ...
  
  void main()
  {
     char[] nullstr;
 
     assert( ! (nullstr is null) );
  }
 
 Namely, the attempted use of a string that has never had any assignment
 yet.

There is nothing wrong with using an unassigned string,
since all arrays (including char[]) default to length 0...

You can pass "nullstr" to writefln and friends, just fine.

 But as toUTFxx() returns that something that looks like an unassigned
 string, I can't test for unassigned strings. 

If you really, really, want to test for "unassigned" strings - use .ptr:

void main()
{
   char[] s = "";
   char[] d = null;

   assert(s.ptr != null);
   assert(d.ptr == null);
}

This is because the ptr of a string literal will point to a '\0' char.

 I still think that the toUTFxx() functions should return an empty string if
 an empty string was passed to them.

There is *no* difference in D, between null and the empty string.

They both have the length property set to 0, and they're equal.
(not identical, though, so using "is" between them will fail)

--anders

Feb 10 2005

Derek <derek psych.ward> writes:

On Thu, 10 Feb 2005 15:21:21 +0100, Anders F Bj�rklund wrote:

 Derek wrote:
 
I'm not sure what you are trying to test, but:

 
 I'm testing for this ...
  
  void main()
  {
     char[] nullstr;
 
     assert( ! (nullstr is null) );
  }
 
 Namely, the attempted use of a string that has never had any assignment
 yet.

 
 There is nothing wrong with using an unassigned string,
 since all arrays (including char[]) default to length 0...
 
 You can pass "nullstr" to writefln and friends, just fine.
 
 But as toUTFxx() returns that something that looks like an unassigned
 string, I can't test for unassigned strings. 

 
 If you really, really, want to test for "unassigned" strings - use .ptr:
 
 void main()
 {
    char[] s = "";
    char[] d = null;
 
    assert(s.ptr != null);
    assert(d.ptr == null);
 }
 
 This is because the ptr of a string literal will point to a '\0' char.
 
 I still think that the toUTFxx() functions should return an empty string if
 an empty string was passed to them.

 
 There is *no* difference in D, between null and the empty string.
 
 They both have the length property set to 0, and they're equal.
 (not identical, though, so using "is" between them will fail)

Yes, I understand the technical aspect of this. However, I was attempting
to help the coder trap mistakes; namely the use of unassigned strings. The
assumption is that if a coder declares a string, and uses it before
assigning anything to it, then it might mean that there is a logic error in
the code. This is slightly different from the use of numbers, as most
people expect that numbers are zero upon declaration. But still, its just a
philosophy question really. Walter has decided for us that unassigned
variables are an acceptable practice, where as pedantic people such as
myself think that they might indicate errors in coding.

I will, no doubt, have to adjust to the given situation as it ain't gonna
change ;-)

-- 
Derek
Melbourne, Australia

Feb 10 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Thu, 10 Feb 2005 15:21:21 +0100, Anders F Bj�rklund <afb algonet.se>  
wrote:
 There is *no* difference in D, between null and the empty string.

There is a difference, internally, but D treats them the same. Which is  
probably what you meant, but I'm just being thourough. :)

A null string has ptr == null, an empty string has ptr == "".

In some instances it is crucial to be able to tell these cases apart:
  1- value does not exist (null)
  2- value is blank       (empty string)

To check for case 1, we can go "if (s is null)"
To check for case 2, we can go "if (s.length == 0)"

eg. Simple example where it is important:

User enters data into a text field (A) on a web page, leaves text field  
(B) blank, the code is saving the values of these two fields somewhere  
i.e. in a database containing 3 settings A, B and C.

The presence of the emtpy field (B) on the page indicates any previous  
value for that setting should be overwritten with the empty value.

The absense of the field (C) indicates that any previous value of the  
setting should not be overwritten but kept.

Regan

Feb 10 2005

Derek Parnell <derek psych.ward> writes:

On Fri, 11 Feb 2005 10:05:06 +1300, Regan Heath wrote:

 On Thu, 10 Feb 2005 15:21:21 +0100, Anders F Bj�rklund <afb algonet.se>  
 wrote:
 There is *no* difference in D, between null and the empty string.

 
 There is a difference, internally, but D treats them the same. Which is  
 probably what you meant, but I'm just being thourough. :)
 
 A null string has ptr == null, an empty string has ptr == "".
 
 In some instances it is crucial to be able to tell these cases apart:
   1- value does not exist (null)
   2- value is blank       (empty string)

Exactly! Well said.

-- 
Derek
Melbourne, Australia
11/02/2005 9:49:04 AM

Feb 10 2005

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Derek Parnell wrote:

There is *no* difference in D, between null and the empty string.
There is a difference, internally, but D treats them the same. Which is  
probably what you meant, but I'm just being thourough. :)


More or less, yes. But that's more of an Implementation Quirk™.

The D specification explicitly says:

http://www.digitalmars.com/d/arrays.html
 Array Initialization
 
     * Dynamic arrays are initialized to having 0 elements.

http://www.digitalmars.com/d/cppstrings.html
 Checking For Empty Strings

  In D, an empty string is just null:
 
 	char[] str;
 	if (!str)
 		// string is empty

But in practice, they do differ - in the ptr to the '\0' (for C).
(but both has a length property of 0, though, as mentioned earlier)

And when you copy the char[], this ptr settings follows as well...
This means that there is a way to trace if it has been set to "".


A null string has ptr == null, an empty string has ptr == "".

In some instances it is crucial to be able to tell these cases apart:
  1- value does not exist (null)
  2- value is blank       (empty string)

 
 Exactly! Well said.

But strings in D are not objects or pointers, they are arrays...
And arrays are initialized to have the length zero, in the spec.

Thus, that makes them similar to e.g. an integer that is initialized
with a zero ? You will have to check if they are modified in some
other way. Or just rely on the "string.ptr" value, since that will
work as long as D supports calling C functions with string literals...


But technically, there is no difference in D between "" and null.
Which is probably why the standard library mixes them freely ?

To recap:

""
     .length = 0
     .ptr = &'\0'

null
     .length = 0
     .ptr = null

 void main()
 {
   char[] emptystr = "";
   char[] nullstr = null;
 
   assert(emptystr == nullstr);
   assert(!(emptystr is nullstr));
 
   assert(emptystr.length == nullstr.length);
   assert(!(emptystr.ptr is nullstr.ptr));
 }

And the D standard library should probably be "fixed" to return
null for null and "" for "" anyway, even if it not's in the spec ?

Care to write a full unittest for it ? (at least for all of std.utf)

--anders

Feb 11 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Fri, 11 Feb 2005 17:54:45 +0100, Anders F Björklund <afb algonet.se>  
wrote:
 Derek Parnell wrote:

 There is *no* difference in D, between null and the empty string.
 There is a difference, internally, but D treats them the same. Which  
 is  probably what you meant, but I'm just being thourough. :)


 More or less, yes. But that's more of an Implementation Quirk™.

Which worries me because I believe there is a real need to tell them apart.

So, I ask that this behaviour be specified, or another method to achieve  
the same thing be specified.

 The D specification explicitly says:

 http://www.digitalmars.com/d/arrays.html
 Array Initialization
      * Dynamic arrays are initialized to having 0 elements.

 http://www.digitalmars.com/d/cppstrings.html
 Checking For Empty Strings

  In D, an empty string is just null:
  	char[] str;
 	if (!str)
 		// string is empty

 But in practice, they do differ - in the ptr to the '\0' (for C).
 (but both has a length property of 0, though, as mentioned earlier)

Sure, exactly what I said.

 And when you copy the char[], this ptr settings follows as well...
 This means that there is a way to trace if it has been set to "".

Yep, I want this behaviour to be specified. (or some other method to  
achieve what I want)

 A null string has ptr == null, an empty string has ptr == "".

 In some instances it is crucial to be able to tell these cases apart:
  1- value does not exist (null)
  2- value is blank       (empty string)

  Exactly! Well said.

 But strings in D are not objects or pointers, they are arrays...

And arrays appear to be value types containing a 'reference'. As in,  
arrays themselves cannot be null, but the reference in them can be.

 And arrays are initialized to have the length zero, in the spec.
 Thus, that makes them similar to e.g. an integer that is initialized
 with a zero ?

I agree arrays are value types, as integers are.

For a null string, the length is initialised to 0.

For a "" string the length is initialised to the length of "", which  
happens to be 0.

For a "abc" string the length is initialised to the length of "abc", which  
happens to be 3.

 You will have to check if they are modified in some
 other way. Or just rely on the "string.ptr" value, since that will
 work as long as D supports calling C functions with string literals...

In C strings are pointers, and pointers can be null or point to a piece of  
memory which may contain a \0, so, in C there is a way to tell the 2 cases  
apart.

In D arrays are value types containing a pointer/reference and a length.

I firmly believe that loosing this ability for char[] would become a  
weakness in D, it would force me and others to resort to other methods to  
achieve it.

I like the current behaviour, I just want to see it doesn't change.

 But technically, there is no difference in D between "" and null.
 Which is probably why the standard library mixes them freely ?

 To recap:

 ""
      .length = 0
      .ptr = &'\0'

 null
      .length = 0
      .ptr = null

Yep, like I said.

 void main()
 {
   char[] emptystr = "";
   char[] nullstr = null;
    assert(emptystr == nullstr);
   assert(!(emptystr is nullstr));
    assert(emptystr.length == nullstr.length);
   assert(!(emptystr.ptr is nullstr.ptr));
 }

 And the D standard library should probably be "fixed" to return
 null for null and "" for "" anyway, even if it not's in the spec ?

Definately. I've been saying null and "" can mean different things  
depending on the context, you seem to be agreeing, why are we arguing? :)

 Care to write a full unittest for it ? (at least for all of std.utf)

First we have to decide (on a per function basis) whether returning null  
or "" makes sense, or if in deed both make sense (for different reasons of  
course) i.e.

null == failed, cannot convert, malfomed?
""   == success, result really is ""

Regan

Feb 13 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - toUTFxx returns null references