www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 1654] New: Array concatenation should result in mutable or invariant depending on usage

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654

           Summary: Array concatenation should result in mutable or
                    invariant depending on usage
           Product: D
           Version: 2.007
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla digitalmars.com
        ReportedBy: schveiguy yahoo.com


One of the great aspects of D is the array concatenation operator.  However,
the current compiler sets the type of the result according to the operands. 
This prevents using invariant or const arrays to build a mutable array without
doing extra work.  For example, the toStringz function:

char *toStringz(const(char)[] s)
{
  copy = new char[s.length + 1];
  copy[0..s.length] = s;
  copy[s.length] = 0;
  return copy.ptr;
}

This could easily be written as:
return (s ~ "\0").ptr;

Except that the result of the concatenation is a const(char)[].

But why does it need to be const?  I'm guessing that the reason is because when
dealing with functions that take invariant strings, it would be ugly to always
have to cast to invariant when doing concatenation.  And of course, functions
that use invariant strings can be optimized differently than mutable or even
const strings.

The issue I see is that the true result *IS* mutable, because it is generated
from the heap.  It's artificially cast to invariant to keep the type the same.

So here is a proposal that would allow efficiency, and utilization of the fact
that concatenation always results in newly allocated data:

First, there should be two array concatenation operator internal functions. 
One that takes two invariant arrays and one that takes two const arrays.  I'll
call them icat and ccat respectively.  Both will return mutable arrays.  The
icat function can be pure when pure functions are supported.

When the compiler encounters an array concatenation in code, if at least one
argument is not invariant, then ccat will be used.  If both arguments are
invariant, then icat will be used.

Regardless of the method used, if the resulting rvalue is expected to be
invariant, then an implicit invariant cast is allowed.  This means that the
following code does not need invariant casts:

char[] blah = "hello".dup;
string blah2 = blah ~ "world";

If the result is assigned to an lvalue that is either const or mutable, then no
cast is needed (mutable array is implicitly castable to const).

This would allow us to use the full potential of array concatenation without
having to worry about casting away invariant or writing several lines of code
to get around this problem.

I'll further point out that the workaround for the current problem does not
allow efficient concatenation.  For example:

const(char)[] a1, a2;

// method 1, just dup it.
// the problem is that we make a needless temporary copy of the data
char[] cat1 = (a1 ~ a2).dup;

// method 2, build a temporary array
// the problem is that the initialization of the array is needless.
char[] cat2 = new char[a1.length + a2.length]; // needless init of memory
cat2[0..a1.length] = a1;
cat2[a1.length..$] = a2;

In addition, idup is not needed for creating an invariant array out of the
concatenation of two mutable or const arrays.


-- 
Nov 09 2007
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654


andrei metalanguage.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED




------- Comment #1 from andrei metalanguage.com  2008-01-31 14:14 -------
We are well aware of the limitation, and are considering a number of approaches
to it, some along the lines suggested by this enhancement request. It is a hard
problem for multiple reasons. One of them is that the solution suggested in the
enhancement request is unsound for a number of types, including arrays of
arrays. This is because concatenating two arrays of arrays is not "deep" - it
will create a new outer array but it will not duplicate each of the contained
array. Consider:

    int[][] array1 = new int[][10];
    invariant(int[]) x = ([ 1, 2, 3 ]).idup;
    invariant(int[])[] array2 = [ x ];
    int[][] array3 = array1 ~ array2;
    array3[10][0] = 5; // unsound, changes invariant array

This illustrates how code with the proposed semantics can violate immutability. 

To recap today's semantics, consider:

import std.stdio;

struct S
{
    int * p;
}

void main()
{
    test!(int);
    test!(int[]);
    test!(S);
}

void test(T)()
{
    auto m = new T[10];
    auto c = new const(T)[15];
    auto i = new invariant(T)[20];
    auto mm = m ~ m;
    auto mc = m ~ c;
    auto mi = m ~ i;
    auto cc = c ~ c;
    auto ci = c ~ i;
    auto ii = i ~ i;
    writeln("mm: ", typeof(mm).stringof);
    writeln("mc: ", typeof(mc).stringof);
    writeln("mi: ", typeof(mi).stringof);
    writeln("cc: ", typeof(cc).stringof);
    writeln("ci: ", typeof(ci).stringof);
    writeln("ii: ", typeof(ii).stringof);
}

This program prints:

mm: int[]
mc: int[]
mi: int[]
cc: const(int)[]
ci: int[]
ii: invariant(int)[]
mm: int[][]
mc: const(int)[][]
mi: const(int)[][]
cc: const(int[])[]
ci: const(int)[][]
ii: invariant(int[])[]
mm: S[]
mc: const(S)[]
mi: const(S)[]
cc: const(S)[]
ci: const(S)[]
ii: invariant(S)[]

which is quite inconsistent (handles int, int[], and S all differently) and
worthy a bug report.

One direction explored to address this problem is allowing definition of
polysemous types - types that can exist only as rvalues and allow implicit
conversion to a number of other types. Within that framework it should be
possible to design a parameterized type called Unique!(T) that allows
conversions to invariant(T) and T alike. Then a function such as listdir could
return Unique!(char[]) which can be converted, to its client's desire, to
either a mutable or immutable array.

Unique!(T) would remain unsound because there is no statically checkable way to
initialize it properly, so it needs to be used with care. Basically the issue
of typing "fresh" data reveals a hole in our type system that would be hard to
fix without complicating it to an extent we considered unacceptable. To
estimate the work involved for both compiler designers and users, I suggest
consulting the paper archjava.fluid.cs.cmu.edu/papers/oopsla02.pdf (look for
"unique).


Andrei


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #2 from schveiguy yahoo.com  2008-01-31 14:30 -------
Thank you for looking at this issue.  You are correct in your statement that my
proposed solution does not solve your specific example.  There could be other
examples, such as if you had a struct S that contained a pointer, then
concatenating an invariant(S)[] into a mutable S[] would result in the pointer
violating the immutability.

However, this could be alleviated if we had the following rule:

if X is a data type, and v1 is of the type invariant(X) and v2 is of the type
X, it is allowed to copy v1 to v2 provided that the X data type is not a
pointer or a reference or contains any pointers or references.

I think this rule is sound and really should be an enhancement on its own, but
it provides a way that my solution could work.  With this rule, my example
passes the rule since copying an invariant(char) to a char is allowed, but your
example fails because copying an invariant(int[]) to a int[] fails.

What do you think?


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #3 from andrei metalanguage.com  2008-01-31 15:45 -------
The rule you mention was suggested and rejected. It is sound, but the problem
with it is that the semantics of a type becomes implicitly dependent on an
implementation detail (e.g. a private pointer member). We don't want to
separate "types with at least a pointer in'em" from "types with no pointers
in'em". 


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #4 from schveiguy yahoo.com  2008-01-31 16:32 -------
Forgive me for pushing this point, but it seems that D already has types that
are inherently considered differently because they have pointers.  i.e.
classes:

X *x = new X;

If X is a struct or int, this works.  If it is a class, it fails.

Also, how is treating value types different from pointer-containing types a
problem?  I don't see how it is detrimental to the language.

Thanks for hearing my opinions on this, I'm sure you've discussed this type of
thing ad-nauseam.


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #5 from andrei metalanguage.com  2008-01-31 17:26 -------
It's great to push the point. It is true that classes are reference types and
as such are distinct from structs and builtins, which are value types. But
that's a decision that is taken up front by the type designer: do I want
dynamic polymorphism, inheritance, infinite lifetime (garbage collection),
reference semantics? Then I write:

class Foo { ... }

Conversely: do I want eager copy, value semantics, deterministic termination
(that's to come in D 2.0 in a couple of weeks)? Then I write:

struct Bar { ... }

This does introduce an inconsistency in the type system, but it tends to work
well in practice and is documented upfront in the definition of the type. In
contrast, agreeing to further part types depending on whether they contain
pointers or not is murkier.

That is not even the main problem. At the risk of spreading FUD, Real Soon Now,
not all types containing pointers will be non-values. In a couple of weeks
we'll be able to write (syntax details may change):

struct Value {
  int * p = null;
  this() { p = new int; }
  ~this() { delete p; }
  this(scope) { p = new int(*p); }
}

This would effectively implement a structure containing a pointer but with 100%
value semantics (the third function is automatically invoked whenever an object
is copied into a new scope, right after the bitwise copy).

Note that the assignment operator will do the right thing, which is: (1)
duplicate the source into a temporary by bitwise copying it and then invoking
this(scope), (2) destroy the assignment target, (3) bitwise copy the temporary
into the assignment target.


Andrei


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #6 from schveiguy yahoo.com  2008-01-31 17:49 -------
I still believe the rule works.

The general rule I am extending is, for any basic type(int, char, byte, etc.),
you can copy an invariant version to a mutable version without issue.  i.e.:

invariant(int) x = 1;
int y = x; // works (I think :)

If we imagine that any time you copy a structure or a buffer, instead of doing
a memcpy (which is an implementation detail), you do a copy of each
member/element.

so:

char[5] x;
x[] = "hello";

is equivalent to:

foreach(i, c; "hello")
   x[i] = c;

Which should work because of the first rule (I can copy an invariant(char) to a
char).

Now, for a struct, you can postulate the same thing:

struct S
{
  int x;
}

invariant(S) s1 = cast(invariant)S(1);
S s2;

s2 = s1;

is equivalent to:

s2.x = s1.x;

Which should work because of the first rule.

However, if we think about pointers:

invariant(char)* cp = "hello".ptr;
char * cp2;
cp2 = cp;

this fails because now you have a mutable pointer to invariant data.  Here is
the rule I am trying to exploit.  Because a type has a pointer in it:

struct S
{
  char *cp;
}

invariant(S) s1;
S s2;
s2 = s1;

This is equivalent to:
s2.cp = s1.cp;

which fails the rule above, so the whole thing fails.

With your new syntax, let's look at how it works:

struct Value {
  int * p = null;
  this() { p = new int; }
  ~this() { delete p; }
  this(scope) { p = new int(*p); }
}

invariant(Value) v1;
Value v2;

v2 = v1;
Now, this is possible.  Because instead of using p, we are using *p, which is
of type invariant(int).  invariant(int) can be casted to int implicitly, and so
the rule still works.

I'm not sure if this was intended as a use for your new syntax, but I certainly
still believe that it doesn't break the rules (BTW, I like the idea).

Addressing your point about parting types if they contain pointers are not, I
don't really think of it that way.  What I think of it as is that an invariant
pointer cannot be implicitly copied to a mutable pointer.  If by copying a
type, you have to copy an invariant pointer to a mutable pointer, then the
compile fails.  If you do not have to copy an invariant pointer to a mutable
pointer, then the copy works.  It's just easier to explain as "types that
contain invariant pointers cannot be copied to a non-invariant version of the
type".  The compiler can check this without adding extra runtime information,
so implementation-wise it does not affect the output.  Semantic-wise, it
doesn't break existing code, it just allows operations that should be valid but
aren't under the current scheme.  In fact, I think it's more consistent with
the way builtin types work.

BTW, the same applies to const.

-Steve


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654


andrei metalanguage.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|enhancement                 |major




------- Comment #7 from andrei metalanguage.com  2008-01-31 19:02 -------
That's a very interesting idea. Among other things, you show how our copy
construction scheme is incomplete because it "loses" the qualifiers of the
source, information that turns out to be essential. 

I'll make sure I'll bring this up to Walter, and to give you proper credit if
this idea makes it into the language. Thanks!


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #8 from schveiguy yahoo.com  2008-01-31 21:09 -------
*blushes* :)  Thanks for considering the idea!


-- 
Jan 31 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654


htvennik zonnet.nl changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |htvennik zonnet.nl




------- Comment #9 from htvennik zonnet.nl  2008-03-26 15:05 -------
I read all comments on this enhancement request yesterday, and now I was
thinking about it again and I suddenly realized that the solution is very
simple in fact. D should be extended with a new implicit type qualifier
'unique'. By 'implicit' I mean that such a qualifier should NOT appear anywhere
in the syntax, just like 'mutable' does not appear anywhere in the syntax, but
still exists conceptually. The unique qualifier should be automatically applied
by the compiler to the result type of any expression returning a reference to
newly constructed (or copied) data. Anything of a type qualified as unique
could be implicitly cast to any of mutable, invariant and const.

So if the array concatenation operator is modifier to always return a unique
reference, the following would all be perfectly valid:

char[] mutableString = "foo" ~ "bar";
invariant(char[]) invariantString = "foo" ~ "bar";
invariant(char)[] reassignableInvariantString = "foo" ~ "bar";

Of course a NewExpression would also return a unique reference:

struct S {
    int i;
}

invariant(S) s = new S(4);   // valid


-- 
Mar 26 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #10 from htvennik zonnet.nl  2008-03-26 15:11 -------
The following is of course wrong in my previous comment:
 
 invariant(S) s = new S(4);   // valid
 

It should be: invariant(S)* s = new S(4); // valid --
Mar 26 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 26/03/2008, d-bugmail puremagic.com <d-bugmail puremagic.com> wrote:
  Of course a NewExpression would also return a unique reference:

That doesn't follow. Counterexample follows: char[1000] globalBuffer; class S { char[] buffer; this(int n) { buffer = globalBuffer[0..n]; } } invariant(S) s = new S(4); // not valid This is harder than you think! :-)
Mar 26 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #12 from htvennik zonnet.nl  2008-03-28 14:07 -------
Yes, you are right here... The rule doesn't apply to NewExpressions if they
produce a class instance that may contain a pointer to mutable data. So best
would be to not apply it to class instances at all (to keep the rule simple).

But it still holds for concatenating arrays of value types and for
NewExpressions that produce a struct instance.


-- 
Mar 28 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #13 from brunodomedeiros+bugz gmail.com  2008-05-01 12:34
-------
(In reply to comment #2)
 However, this could be alleviated if we had the following rule:
 if X is a data type, and v1 is of the type invariant(X) and v2 is of the type
 X, it is allowed to copy v1 to v2 provided that the X data type is not a
 pointer or a reference or contains any pointers or references.
 I think this rule is sound and really should be an enhancement on its own, but
 it provides a way that my solution could work.  With this rule, my example
 passes the rule since copying an invariant(char) to a char is allowed, but your
 example fails because copying an invariant(int[]) to a int[] fails.
 What do you think?

"copying an invariant(int[]) to a int[]" fails indeed, but copying invariant(int[]) to an invariant(int)[] would work as well. So basicly what we are looking into here is tailconst. The concatenation of a const(T)[] with another const(T)[] conceptually returns a type that is: tailconst(T)[] and similarly for invariant/tailinvariant. This is because the "head-value" of T is copied, so it can be mutable. So these semantics are safe: int[] ~ int[] --> returns int[] const(int)[] ~ const(int)[] --> returns int[] int[][] ~ int[][] --> returns int[][] const(int[])[] ~ const(int[])[] --> returns const(int)[][] const(int)[][] ~ const(int)[][] --> returns const(int)[][] How about this though? : const(Foo)[] ~ const(Foo)[] --> returns ??? Unfortunately there is no way in D to express tailconst for classes (and structs), so the last example would have to use normal const(T) instead of tailconst(T). I'm not sure what the implications of this special, and inconsistent case are tough, it could break generic code or something. Which does not mean these semantics would not be worthwhile. --
May 01 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #14 from schveiguy yahoo.com  2008-05-01 13:48 -------
(In reply to comment #13)
 "copying an invariant(int[]) to a int[]" fails indeed, but copying
 invariant(int[]) to an invariant(int)[] would work as well.
 So basicly what we are looking into here is tailconst. The concatenation of a
 const(T)[] with another const(T)[] conceptually returns a type that is:
 
   tailconst(T)[]

Not exactly. The enhancement request has kind of mutated, due to the rule that I stipulated (which I think was subsequently added to D 2). What I think it should be is const(T)[] ~ const(T)[] should return a type that is T[] if const(T) implicitly casts to T under the rule. If T does not implicitly cast, it should return const(T)[]. The whole idea behind this is that ~ is always generating new data. Why should the result always be based on the input types? At the very least, The pieces of the new array that are unique should be unique, and implicitly cast to mutable. This problem is not an easy one to solve, I realize, but it is solvable.
 
 and similarly for invariant/tailinvariant. This is because the "head-value" of
 T is copied, so it can be mutable. So these semantics are safe:
 
 int[] ~ int[]                           --> returns int[]
 const(int)[] ~ const(int)[]             --> returns int[]
 int[][] ~ int[][]                       --> returns int[][]
 const(int[])[] ~ const(int[])[]         --> returns const(int)[][]
 const(int)[][] ~ const(int)[][]         --> returns const(int)[][]
 
 How about this though? :
 
 const(Foo)[] ~ const(Foo)[]             --> returns ???

If Foo is a struct with no pointers, then it returns Foo[]. If Foo is a class, a pointer, or a struct with pointers, then it returns const(Foo)[].
 Unfortunately there is no way in D to express tailconst for classes (and
 structs), so the last example would have to use normal const(T) instead of
 tailconst(T). I'm not sure what the implications of this special, and
 inconsistent case are tough, it could break generic code or something. Which
 does not mean these semantics would not be worthwhile.

Yes these cases would have to be handled delicately. There are other things to consider. For example, if Foo is an alias to int *, then if you returned a tailconst version of Foo what is that? It would really be const(int)*, but how is that expressed in terms of Foo? What if Foo is a typedef to int *? There are really two problems that need to be solved to allow this enhancement. The first I believe is already solved, and that is the ability to implicitly cast to/from const/invariant/mutable if the type contains no pointers. The second is to allow the return type of a function to depend on how it is used and/or called. I'm not talking about overloading functions based on return type, because that isn't enough in itself. Why should I have to write multiple functions that act the same way, just so the return type is different? I keep thinking that a 'unique' concept needs to be added before this whole solution is possible. Anyways, this issue really doesn't look like it will be solved in the near future, without good answers for all the sub-issues involved. --
May 01 2008
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #15 from brunodomedeiros+bugz gmail.com  2008-05-28 16:46
-------
(In reply to comment #14)
 (In reply to comment #13)
 "copying an invariant(int[]) to a int[]" fails indeed, but copying
 invariant(int[]) to an invariant(int)[] would work as well.
 So basicly what we are looking into here is tailconst. The concatenation of a
 const(T)[] with another const(T)[] conceptually returns a type that is:
 
   tailconst(T)[]

I stipulated (which I think was subsequently added to D 2). What I think it should be is const(T)[] ~ const(T)[] should return a type that is T[] if const(T) implicitly casts to T under the rule. If T does not implicitly cast, it should return const(T)[].

I understand your rule, but it is not the most permissive safe-behavior. What I was describing was the most permissive safe-behavior possible. For example, with your rule: const(int[])[] ~ const(int[])[] would return const(int[])[] . But it is safe to return const(int)[][] , which is more permissive. --
May 28 2008
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=1654





------- Comment #16 from schveiguy yahoo.com  2008-05-28 17:36 -------
Yes, you are correct.  The second form is more permissive.  I didn't see that
originally.

Concatenating arrays of arrays and arrays of pointers are special cases because
a tail-const version exists.

I think as long as the implicit casting works, so:

const(int[])[] a;
const(int)[][] b = a ~ a;
const(int[])[] c = a ~ a;

all works, then it shouldn't be a problem.  It probably will not break generic
code because if you have a generic function:

f(T)(const(T)[] input) {...}

How does one construct a 'tailconst' verison of T given just T?  I think that
was what I was saying earlier (about a typedef).  You could do an is-expression
to get at the underlying type, but then that is specialized for arrays, and it
is no more convoluted than the current state of affairs.

If you know T is always going to be an array, you can do:

f(T)(const(T[])[] input) {...}

and you can universally deal with all types of arrays.

So now, I think the rules are:

If concatenating two arrays together, of type T[]:

If T is of the form const(U[]), then one can assign the result to a type of
const(U)[][]

If T is of the form const(U*), then one can assign the result to a type of
const(U)*[]

If T is of the form const(U), and const(U) implicitly casts to U, then one can
assign the result to a type of U[].

For the above rules, if T is invariant(U) the same style rules apply.

In all cases, one can assign to the result type of T[].

If concatenating two arrays together where the element type of the array varies
only in constancy, and const(T) and invariant(T) implicitly cast to/from T,
then the result can be assigned to invariant(T)[], const(T)[] or T[].

Did I miss anything?  Boy this is getting more complex to explain :)


-- 
May 28 2008