www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Cloning in D

reply dsimcha <dsimcha yahoo.com> writes:
I've started playing around with Orange a little to see whether it would meet
D's cloning needs.  IMHO one must-have feature for proper cloning that truly
"just works" is full aliasing preservation.  For example, the following code
modified slightly from the Orange example doesn't work properly:

import orange._; // import the whole library

class A
{
    int[] arr1;
    int[] arr2;

    equals_t opEquals (Object other)
    {
        if (auto a = cast(A) other)
            return a.arr1 == this.arr1 && a.arr2 == this.arr2;

        return false;
    }
}

void main ()
{
    auto a = new A; // create something to serialize
    a.arr1 = [1,2,3,4,5];
    a.arr2 = a.arr1[1..$];

    auto serializer = new Serializer!(XMLArchive!());
    auto data = serializer.serialize(a);

    println(data);

    auto a2 = serializer.deserialize!(A)(data);
    assert(a == a2);

    a2.arr2[0] = 0;
    println(a2.arr1);  // [1,2,3,4,5]
}

Note that Orange gets this right for class references that point to the same
object, but not for arrays that overlap.

A few questions:

1.  Are most serialization libraries in other languages capable of getting
this right?

2.  Do others agree that full aliasing preservation is essential with regard
to array slices?

3.  Jacob, do you think you could fix Orange to support this properly?
Sep 05 2010
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2010-09-06 05:15, dsimcha wrote:
 I've started playing around with Orange a little to see whether it would meet
 D's cloning needs.  IMHO one must-have feature for proper cloning that truly
 "just works" is full aliasing preservation.  For example, the following code
 modified slightly from the Orange example doesn't work properly:

 import orange._; // import the whole library

 class A
 {
      int[] arr1;
      int[] arr2;

      equals_t opEquals (Object other)
      {
          if (auto a = cast(A) other)
              return a.arr1 == this.arr1&&  a.arr2 == this.arr2;

          return false;
      }
 }

 void main ()
 {
      auto a = new A; // create something to serialize
      a.arr1 = [1,2,3,4,5];
      a.arr2 = a.arr1[1..$];

      auto serializer = new Serializer!(XMLArchive!());
      auto data = serializer.serialize(a);

      println(data);

      auto a2 = serializer.deserialize!(A)(data);
      assert(a == a2);

      a2.arr2[0] = 0;
      println(a2.arr1);  // [1,2,3,4,5]
 }

 Note that Orange gets this right for class references that point to the same
 object, but not for arrays that overlap.

 A few questions:

 1.  Are most serialization libraries in other languages capable of getting
 this right?

 2.  Do others agree that full aliasing preservation is essential with regard
 to array slices?

 3.  Jacob, do you think you could fix Orange to support this properly?

I can have a look and see what I can do. Could you please report a ticket so I don't forget. -- /Jacob Carlborg
Sep 06 2010
parent reply Jacob Carlborg <doob me.com> writes:
On 2010-09-06 10:44, Jacob Carlborg wrote:
 On 2010-09-06 05:15, dsimcha wrote:
 I've started playing around with Orange a little to see whether it
 would meet
 D's cloning needs. IMHO one must-have feature for proper cloning that
 truly
 "just works" is full aliasing preservation. For example, the following
 code
 modified slightly from the Orange example doesn't work properly:

 import orange._; // import the whole library

 class A
 {
 int[] arr1;
 int[] arr2;

 equals_t opEquals (Object other)
 {
 if (auto a = cast(A) other)
 return a.arr1 == this.arr1&& a.arr2 == this.arr2;

 return false;
 }
 }

 void main ()
 {
 auto a = new A; // create something to serialize
 a.arr1 = [1,2,3,4,5];
 a.arr2 = a.arr1[1..$];

 auto serializer = new Serializer!(XMLArchive!());
 auto data = serializer.serialize(a);

 println(data);

 auto a2 = serializer.deserialize!(A)(data);
 assert(a == a2);

 a2.arr2[0] = 0;
 println(a2.arr1); // [1,2,3,4,5]
 }

 Note that Orange gets this right for class references that point to
 the same
 object, but not for arrays that overlap.

 A few questions:

 1. Are most serialization libraries in other languages capable of getting
 this right?

 2. Do others agree that full aliasing preservation is essential with
 regard
 to array slices?

 3. Jacob, do you think you could fix Orange to support this properly?

I can have a look and see what I can do. Could you please report a ticket so I don't forget.

I've looked at this problem now but I don't know I can detect the aliasing. Suggestions ? -- /Jacob Carlborg
Sep 06 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Jacob Carlborg (doob me.com)'s article
 I've looked at this problem now but I don't know I can detect the
 aliasing. Suggestions ?

Well, here's what I was thinking of doing when I was thinking of rolling my own: Traverse the object graph once. Create some data structure of all pointer indirection, including both the start address and the length. Then, sort this by starting address and find all overlapping regions. Whenever you find an overlapping region, duplicate/serialize it as a single block and make everything a slice into it.
Sep 06 2010
parent Jacob Carlborg <doob me.com> writes:
On 2010-09-06 16:01, dsimcha wrote:
 == Quote from Jacob Carlborg (doob me.com)'s article
 I've looked at this problem now but I don't know I can detect the
 aliasing. Suggestions ?

Well, here's what I was thinking of doing when I was thinking of rolling my own: Traverse the object graph once. Create some data structure of all pointer indirection, including both the start address and the length. Then, sort this by starting address and find all overlapping regions. Whenever you find an overlapping region, duplicate/serialize it as a single block and make everything a slice into it.

I will give this a try. -- /Jacob Carlborg
Sep 06 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 09/05/2010 10:15 PM, dsimcha wrote:
 I've started playing around with Orange a little to see whether it would meet
 D's cloning needs.  IMHO one must-have feature for proper cloning that truly
 "just works" is full aliasing preservation.

What do you mean by aliasing preservation? I'd think you want no aliasing between the clone and the original. (Perhaps you mean that the object graph of the clone and original look the same?) First off, let's see what cloning methods we have: 1. We know how to clone built-in types except pointers. (Arrays are cloned by duplicating the array proper and cloning each element.) 2. Structs can be cloned with either this(this) or by cloning each of their fields. 3. For classes we can define a conventional method/interface, e.g. clone/Cloneable. Now, given some arbitrary value v, we know how to obtain a clone of it by applying the rules above. The problem is, there are two opaque calls: this(this) for structs and obj.clone() for classes. If improperly implemented, the whole thing comes unglued. The problem is the standard library (and others) must rely on correct deep cloning without being able to check it. Although it's difficult to implement deep cloning, it's relatively easy to _verify_ it. So I'm thinking, a generic deepdup() routine would apply cloning per the rules above and then would verify that opaque calls obj.clone() and this(this) did not produce any aliasing. Bottom line, this clarifies we need dynamic field information for classes. Otherwise we wouldn't be able to "see" through polymorphism. Andrei
Sep 06 2010
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 On 09/05/2010 10:15 PM, dsimcha wrote:
 I've started playing around with Orange a little to see whether it would meet
 D's cloning needs.  IMHO one must-have feature for proper cloning that truly
 "just works" is full aliasing preservation.

aliasing between the clone and the original. (Perhaps you mean that the object graph of the clone and original look the same?)

Right. This means if you had, for example, two slices that point to overlapping regions of memory anywhere in your original object graph, they'd overlap instead of pointing to distinct regions in your cloned object graph.
 First off, let's see what cloning methods we have:
 1. We know how to clone built-in types except pointers. (Arrays are
 cloned by duplicating the array proper and cloning each element.)

I think we could handle even pointers in some limited cases. If it's allocated on the GC heap (so we can easily find the bounds of the memory region) and the type pointed to doesn't have any additional indirection (so we don't need to follow possibly invalid pointers), then we just duplicate the entire GC memory block. For pointers to non-primitives, we could arbitrarily assume that the pointer points to exactly one instance, and is being used simply as a reference.
 2. Structs can be cloned with either this(this) or by cloning each of
 their fields.

Yeah, structs with postblits are a huge, hairy PITA. If the struct is using reference counting, cloning the entire object graph reachable from the struct will yield an incorrect reference count field. I think that for structs we should always assume that transitively duplicating the entire object graph field by field is a safe way to clone the struct, unless it comes with its own clone() method. this(this) can be used for too many other things, like the aforementioned ref counting.
 3. For classes we can define a conventional method/interface, e.g.
 clone/Cloneable.

Right, and there should be a mixin to implement this for the simple cases, rather than making the programmer do it him/herself. One concern w/ opaque clone() calls is establishing a protocol by which aliasing (array slice overlapping, etc.) information can be shared between the clone() function and the main deepdup() function, as well as among different clone functions.
Sep 06 2010
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2010-09-06 16:12, Andrei Alexandrescu wrote:
 On 09/05/2010 10:15 PM, dsimcha wrote:
 I've started playing around with Orange a little to see whether it
 would meet
 D's cloning needs. IMHO one must-have feature for proper cloning that
 truly
 "just works" is full aliasing preservation.

What do you mean by aliasing preservation? I'd think you want no aliasing between the clone and the original. (Perhaps you mean that the object graph of the clone and original look the same?) First off, let's see what cloning methods we have: 1. We know how to clone built-in types except pointers. (Arrays are cloned by duplicating the array proper and cloning each element.) 2. Structs can be cloned with either this(this) or by cloning each of their fields. 3. For classes we can define a conventional method/interface, e.g. clone/Cloneable. Now, given some arbitrary value v, we know how to obtain a clone of it by applying the rules above. The problem is, there are two opaque calls: this(this) for structs and obj.clone() for classes. If improperly implemented, the whole thing comes unglued. The problem is the standard library (and others) must rely on correct deep cloning without being able to check it. Although it's difficult to implement deep cloning, it's relatively easy to _verify_ it. So I'm thinking, a generic deepdup() routine would apply cloning per the rules above and then would verify that opaque calls obj.clone() and this(this) did not produce any aliasing. Bottom line, this clarifies we need dynamic field information for classes. Otherwise we wouldn't be able to "see" through polymorphism. Andrei

That would be getMembers in ClassInfo, but it doesn't currently work: issue 2844. It also returns an empty array, I think, can't find the ticket for that though. We also need to be able to set and get the value of a field, don't know if this would be possible by using the offset field in the MemberInfo_field class. This is also needed for a completely transparent serialization implementation (i.e. the user doesn't have to implement or register any methods/functions). -- /Jacob Carlborg
Sep 06 2010
prev sibling parent reply BLS <windevguy hotmail.de> writes:
On 06/09/2010 05:15, dsimcha wrote:
 I've started playing around with Orange a little to see whether it would meet
 D's cloning needs.

Clone respective member-wise clone ( I prefer copy and deep copy) should be part of object. no std._tricks period.
Sep 06 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 09/06/2010 02:56 PM, BLS wrote:
 On 06/09/2010 05:15, dsimcha wrote:
 I've started playing around with Orange a little to see whether it
 would meet
 D's cloning needs.

Clone respective member-wise clone ( I prefer copy and deep copy) should be part of object. no std._tricks period.

Though this might be the case in this particular instance, ideally a generally useful operation should be made available generically instead of encapsulated. Andrei
Sep 06 2010
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-09-06 17:00:26 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 09/06/2010 02:56 PM, BLS wrote:
 On 06/09/2010 05:15, dsimcha wrote:
 I've started playing around with Orange a little to see whether it
 would meet
 D's cloning needs.

Clone respective member-wise clone ( I prefer copy and deep copy) should be part of object. no std._tricks period.

Though this might be the case in this particular instance, ideally a generally useful operation should be made available generically instead of encapsulated.

I'm under the impression that a too permissive generic implementation of cloning is going to break things in various scenarios. What if your object or structure is part of a huge hierarchy where things contains pointers to their parent (and indirectly to the whole hierarchy), will the whole hierarchy be cloned? What happens if your object or structure maintains a reference to a singleton, will we get two instances of a singleton? What if your object or structure implements the observer pattern and contains a list of objects to notify when something happens, will all the observers be cloned as well? My understanding is that a data structure containing a pointer cannot be cloned safely unless it contains some specific code to perform the cloning. That's because the type system can't tell you which pointers point to things owned by the struct/class and which one need to be discarded when cloning (such as a list of observers, or the parents of a hierarchy). As for pointers to immutable things, the clone can keep pointing to the same immutable data without affecting semantics, so there's no problem there. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Sep 06 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 I'm under the impression that a too permissive generic implementation
 of cloning is going to break things in various scenarios.

In general you raise some very good issues, but IMHO the right way to do cloning is to have permissive generic cloning that works in the 90% of cases and can be easily overridden in the 10% of cases, not to require writing tons of boilerplate in the 90% of cases just to make sure it doesn't do the wrong thing by default in the 10% of cases. A second point is that the thing that brought this whole cloning issue to my mind was making std.concurrency's message passing model less obtuse. Right now it's hard to use for non-trivial things because there's no safe way to pass complex state between threads. If we start allowing all kinds of exceptions to the "clone the **entire** object graph" rule, cloning will rapidly become useless for safely passing complex object graphs between threads.
 What if your
 object or structure is part of a huge hierarchy where things contains
 pointers to their parent (and indirectly to the whole hierarchy), will
 the whole hierarchy be cloned?

Isn't that kind of the point?
 What happens if your object or structure
 maintains a reference to a singleton, will we get two instances of a
 singleton?

Very good point. I guess the reasonable use case for holding a reference to a singleton (instead of just using the globally accessible one) would be if it's polymorphic with some other object type? If you're using message passing concurrency, most of your mutable singletons are probably thread-local, and what you probably really want to do is use the thread-local singleton of the thread you're passing to.
 What if your object or structure implements the observer
 pattern and contains a list of objects to notify when something
 happens, will all the observers be cloned as well?

Yes. There's no other way to make message passing safe.
 My understanding is that a data structure containing a pointer cannot
 be cloned safely unless it contains some specific code to perform the
 cloning. That's because the type system can't tell you which pointers
 point to things owned by the struct/class and which one need to be
 discarded when cloning (such as a list of observers, or the parents of
 a hierarchy).

This discussion is making me think we really need two kinds of cloning: Physical cloning would clone the entire object graph no matter what, such that the cloned object could be safely passed to another thread via std.concurrency and be given a unique type. Logical cloning would be more like what you describe. In general, this discussion has been incredibly useful because I had previously only considered physical cloning.
 As for pointers to immutable things, the clone can keep pointing to the
 same immutable data without affecting semantics, so there's no problem
 there.

Right. There aren't too many good reasons to clone immutable data.
Sep 06 2010
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-09-06 20:55:16 -0400, dsimcha <dsimcha yahoo.com> said:

 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 I'm under the impression that a too permissive generic implementation
 of cloning is going to break things in various scenarios.

In general you raise some very good issues, but IMHO the right way to do cloning is to have permissive generic cloning that works in the 90% of cases and can be easily overridden in the 10% of cases, not to require writing tons of boilerplate in the 90% of cases just to make sure it doesn't do the wrong thing by default in the 10% of cases.

To me automatic cloning of everything (physical cloning in your parlance) looks more like 50/50 work/doesn't-work ratio. I can only guess, but I'm probably used to different use cases than you are.
 A second point is that the thing that brought this whole cloning issue 
 to my mind
 was making std.concurrency's message passing model less obtuse.  Right now it's
 hard to use for non-trivial things because there's no safe way to pass complex
 state between threads.  If we start allowing all kinds of exceptions to 
 the "clone
 the **entire** object graph" rule, cloning will rapidly become useless 
 for safely
 passing complex object graphs between threads.

This I agree with. I'm not arguing against automatic cloning per-see, I'm just trying to show cases where it doesn't work well. Personally, I'm rather skeptical that we can make it safe and efficient at the same time without better support from the language, something akin the mythical "unique" type modifier representing a reference with no aliasing.
 What if your
 object or structure is part of a huge hierarchy where things contains
 pointers to their parent (and indirectly to the whole hierarchy), will
 the whole hierarchy be cloned?

Isn't that kind of the point?

Well, that depends. If you send each leaves of a tree as a message to various threads presumably to perform something concurrently with the data in that leaf, then you may want only the leaf to be copied. You may not want every parent down to the root and then up to every other leaf to be copied alongside with each message just because the leaf you send has a pointer to the parent. In fact, it depends on the situation. If what you want to do with the leaf in the other thread requires the leaf to know its parent and everything else, then sure you need to copy the whole hierarchy. But otherwise it's a horrible waste of memory and CPU to clone the whole object graph for each message, even though it won't affect the program's correctness. And it's basically the same thing with observers. If your observer is a controller in charge of updating a window when something changes, you don't want to clone the observer, then clone the window and everything in it just because you're sending some piece of data to another thread. Perhaps the program architecture is just wrong, or perhaps that observer is a synchronized class capable of handling function calls from multiple threads so it doesn't really need to be copied.
 What happens if your object or structure
 maintains a reference to a singleton, will we get two instances of a
 singleton?

Very good point. I guess the reasonable use case for holding a reference to a singleton (instead of just using the globally accessible one) would be if it's polymorphic with some other object type? If you're using message passing concurrency, most of your mutable singletons are probably thread-local, and what you probably really want to do is use the thread-local singleton of the thread you're passing to.

What intrigues me is how such a mechanism would work... although in my mind it's probably not even worth supporting at all, singletons be damned!
 My understanding is that a data structure containing a pointer cannot
 be cloned safely unless it contains some specific code to perform the
 cloning. That's because the type system can't tell you which pointers
 point to things owned by the struct/class and which one need to be
 discarded when cloning (such as a list of observers, or the parents of
 a hierarchy).

This discussion is making me think we really need two kinds of cloning: Physical cloning would clone the entire object graph no matter what, such that the cloned object could be safely passed to another thread via std.concurrency and be given a unique type. Logical cloning would be more like what you describe. In general, this discussion has been incredibly useful because I had previously only considered physical cloning.

This is an interesting and valid observation. But I think you need to leave a door open to customization of the "physical cloning" case too. The ability to avoid cloning unnecessary data is as necessary as the ability to easily copying an entire object graph. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Sep 06 2010
parent Jacob Carlborg <doob me.com> writes:
On 2010-09-07 04:16, Michel Fortin wrote:
 On 2010-09-06 20:55:16 -0400, dsimcha <dsimcha yahoo.com> said:

 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 I'm under the impression that a too permissive generic implementation
 of cloning is going to break things in various scenarios.

In general you raise some very good issues, but IMHO the right way to do cloning is to have permissive generic cloning that works in the 90% of cases and can be easily overridden in the 10% of cases, not to require writing tons of boilerplate in the 90% of cases just to make sure it doesn't do the wrong thing by default in the 10% of cases.

To me automatic cloning of everything (physical cloning in your parlance) looks more like 50/50 work/doesn't-work ratio. I can only guess, but I'm probably used to different use cases than you are.
 A second point is that the thing that brought this whole cloning issue
 to my mind
 was making std.concurrency's message passing model less obtuse. Right
 now it's
 hard to use for non-trivial things because there's no safe way to pass
 complex
 state between threads. If we start allowing all kinds of exceptions to
 the "clone
 the **entire** object graph" rule, cloning will rapidly become useless
 for safely
 passing complex object graphs between threads.

This I agree with. I'm not arguing against automatic cloning per-see, I'm just trying to show cases where it doesn't work well. Personally, I'm rather skeptical that we can make it safe and efficient at the same time without better support from the language, something akin the mythical "unique" type modifier representing a reference with no aliasing.
 What if your
 object or structure is part of a huge hierarchy where things contains
 pointers to their parent (and indirectly to the whole hierarchy), will
 the whole hierarchy be cloned?

Isn't that kind of the point?

Well, that depends. If you send each leaves of a tree as a message to various threads presumably to perform something concurrently with the data in that leaf, then you may want only the leaf to be copied. You may not want every parent down to the root and then up to every other leaf to be copied alongside with each message just because the leaf you send has a pointer to the parent. In fact, it depends on the situation. If what you want to do with the leaf in the other thread requires the leaf to know its parent and everything else, then sure you need to copy the whole hierarchy. But otherwise it's a horrible waste of memory and CPU to clone the whole object graph for each message, even though it won't affect the program's correctness. And it's basically the same thing with observers. If your observer is a controller in charge of updating a window when something changes, you don't want to clone the observer, then clone the window and everything in it just because you're sending some piece of data to another thread. Perhaps the program architecture is just wrong, or perhaps that observer is a synchronized class capable of handling function calls from multiple threads so it doesn't really need to be copied.
 What happens if your object or structure
 maintains a reference to a singleton, will we get two instances of a
 singleton?

Very good point. I guess the reasonable use case for holding a reference to a singleton (instead of just using the globally accessible one) would be if it's polymorphic with some other object type? If you're using message passing concurrency, most of your mutable singletons are probably thread-local, and what you probably really want to do is use the thread-local singleton of the thread you're passing to.

What intrigues me is how such a mechanism would work... although in my mind it's probably not even worth supporting at all, singletons be damned!
 My understanding is that a data structure containing a pointer cannot
 be cloned safely unless it contains some specific code to perform the
 cloning. That's because the type system can't tell you which pointers
 point to things owned by the struct/class and which one need to be
 discarded when cloning (such as a list of observers, or the parents of
 a hierarchy).

This discussion is making me think we really need two kinds of cloning: Physical cloning would clone the entire object graph no matter what, such that the cloned object could be safely passed to another thread via std.concurrency and be given a unique type. Logical cloning would be more like what you describe. In general, this discussion has been incredibly useful because I had previously only considered physical cloning.

This is an interesting and valid observation. But I think you need to leave a door open to customization of the "physical cloning" case too. The ability to avoid cloning unnecessary data is as necessary as the ability to easily copying an entire object graph.

I think everything would be a lot easier if we had support for cloning in the standard library. Something like a clone method in Object that as a default implementation clones everything and then you can override it to customize the cloning. There can also be various hooks (static fields or mixins, even better with proper language support) available to say that a certain field shouldn't be cloned, or a whole object. Now with cloning in the standard library other parts of the std can take advantage of these hooks and do the appropriate action. For example, the singleton could say it shouldn't be cloned and the same may go for other classes as well. -- /Jacob Carlborg
Sep 07 2010