digitalmars.D - Is mimicking a reference type with a struct reliable?

Denis Koroskin (55/55) Oct 16 2010 irst I'd like to say that I don't really like (or rather use) Appender

Steven Schveighoffer (43/96) Oct 16 2010 Appender needs to be a reference type. If it's not then copying the

Denis Koroskin (30/137) Oct 16 2010 No, it doesn't use capacity, it uses length as a capacity instead:

Steven Schveighoffer (8/36) Oct 16 2010 Oh, ok. So you are keeping track of the length in a local variable. Th...

Denis Koroskin (54/138) Oct 16 2010 Sorry, I misclicked a button and send the message preliminary.

Steven Schveighoffer (23/96) Oct 16 2010 Yes, doing it this way forces you to use a pointer, since you can't pass...

"Denis Koroskin" <2korden gmail.com> writes:

irst I'd like to say that I don't really like (or rather use) Appender  
because it always allocates (at least an internal Data instance) even when  
I provide my own buffer.
I mean, why would I use Appender if it still allocates? Okay, you have to  
store a reference to an internal representation so that Appender would  
feel like a reference type. I'm not sure it's worth the trade-off, and as  
such I defined and use my own set of primitives that don't allocate when a  
buffer is provided:

void put(T)(ref T[] array, ref size_t offset, const(T) value)
{
     ensureCapacity(array, offset + 1);
     array[offset++] = value;
}

void put(T)(ref T[] array, ref size_t offset, const(T)[] value)
{
     // Same but for an array
}

void ensureCapacity(ref char[] array, size_t minCapacity)
{
    // ...
}

And all that functions that use an optional buffer have a signature like  
this:

void foo(ubyte[] buffer = null);

Back to my original question, can we mimick a reference behavior with a  
struct? I thought why not until I hit this bug:

import std.array;
import std.stdio;

void append(Appender!(string) a, string s)
{
	a.put(s);
}

void main()
{
	Appender!(string) a;
	string s = "test";
	
	append(a, s); // <
	
	writeln(a.data);	
}

I'm passing an appender by value since it's supposed to have a reference  
type behavior and passing 4 bytes by reference is an overkill.

However, the code above doesn't work for a simple reason: structs lack  
default ctors. As such, an appender is initialized to null internally,  
when I call append a copy of it gets initialized (lazily), but the  
original one remains unchanged. Note that if you append to appender at  
least once before passing by value, it will work. But that's sad. Not only  
it allocates when it shouldn't, I also have to initialize it explicitly!

I think far better solution would be to make it non-copyable.

TL;DR Reference semantic mimicking with a struct without default ctors is  
unreliable since you must initialize your object lazily. Moreover, you  
have to check that you struct is not initialized yet every single function  
call, and that's error prone and bad for code clarity and performance. I'm  
opposed of that practice.

Oct 16 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 16 Oct 2010 11:52:29 -0400, Denis Koroskin <2korden gmail.com>  
wrote:

 First I'd like to say that I don't really like (or rather use) Appender  
 because it always allocates (at least an internal Data instance) even  
 when I provide my own buffer.
 I mean, why would I use Appender if it still allocates? Okay, you have  
 to store a reference to an internal representation so that Appender  
 would feel like a reference type.

Appender needs to be a reference type.  If it's not then copying the  
appender will stomp data.  Let's say appender is not a reference type, you  
might expect the data members to look like:

struct Appender(T)
{
    uint capacity;
    T[] data;
}

Now, if you copy an appender to another instance, it gets its *own* copy  
of capacity.  You append to a1, no problems.  You then append to a2 and it  
overwrites the data you put in a1.

It might be possible to do an unsafe appender that uses a pointer to a  
stack variable for its implementation.  But returning such an appender  
would escape stack data.  This would however obviate the need to allocate  
extra data on the heap.

A final option is to disable the copy constructor of such an unsafe  
appender, but then you couldn't pass it around.

What do you think?  If you think it's worth having, suggest it on the  
phobos mailing list, and we'll discuss.

Note that Appender is supposed to be fast at *appending* not initializing  
itself.  In that respect, it's very fast.

  I'm not sure it's worth the trade-off, and as such I defined and use my  
 own set of primitives that don't allocate when a buffer is provided:

 void put(T)(ref T[] array, ref size_t offset, const(T) value)
 {
      ensureCapacity(array, offset + 1);
      array[offset++] = value;
 }

 void put(T)(ref T[] array, ref size_t offset, const(T)[] value)
 {
      // Same but for an array
 }

 void ensureCapacity(ref char[] array, size_t minCapacity)
 {
     // ...
 }

I'm not sure what ensureCapacity does, but if it does what I think it does  
(use the capacity property of arrays), it's probably slower than Appender,  
which has a dedicated variable for capacity.

 Back to my original question, can we mimick a reference behavior with a  
 struct? I thought why not until I hit this bug:

 import std.array;
 import std.stdio;

 void append(Appender!(string) a, string s)
 {
 	a.put(s);
 }

 void main()
 {
 	Appender!(string) a;
 	string s = "test";
 	
 	append(a, s); // <
 	
 	writeln(a.data);	
 }

 I'm passing an appender by value since it's supposed to have a reference  
 type behavior and passing 4 bytes by reference is an overkill.

 However, the code above doesn't work for a simple reason: structs lack  
 default ctors. As such, an appender is initialized to null internally,  
 when I call append a copy of it gets initialized (lazily), but the  
 original one remains unchanged. Note that if you append to appender at  
 least once before passing by value, it will work. But that's sad. Not  
 only it allocates when it shouldn't, I also have to initialize it  
 explicitly!

 I think far better solution would be to make it non-copyable.

 TL;DR Reference semantic mimicking with a struct without default ctors  
 is unreliable since you must initialize your object lazily. Moreover,  
 you have to check that you struct is not initialized yet every single  
 function call, and that's error prone and bad for code clarity and  
 performance. I'm opposed of that practice.

This is a point I've brought up before.  As of yet there is no solution.   
There have been a couple of ideas passed around, but there hasn't been  
anything decided.  The one idea I remember (but didn't really like) is to  
have the copy constructor be able to modify the original.  This makes it  
possible to allocate the underlying implementation in Appender for  
example, even on the data being passed.  There are lots of problems with  
this solution, and I don't think it got much traction.

I think the default constructor solution is probably never going to  
happen.  It's very nice to always have a default fast way to initialize  


My suggestion would be to have it be an actual reference type -- i.e. a  
class.  I don't see any issues with that.  In that respect, you could even  
have it be stack-allocated, since you have emplace.  But I don't have a  
say in that.  I was the last one to update Appender, since it had a  
bug-ridden design and needed to be fixed, but I tried to change as little  
as possible.

-Steve

Oct 16 2010

"Denis Koroskin" <2korden gmail.com> writes:

On Sat, 16 Oct 2010 20:16:40 +0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Sat, 16 Oct 2010 11:52:29 -0400, Denis Koroskin <2korden gmail.com>  
 wrote:

 First I'd like to say that I don't really like (or rather use) Appender  
 because it always allocates (at least an internal Data instance) even  
 when I provide my own buffer.
 I mean, why would I use Appender if it still allocates? Okay, you have  
 to store a reference to an internal representation so that Appender  
 would feel like a reference type.

 Appender needs to be a reference type.  If it's not then copying the  
 appender will stomp data.  Let's say appender is not a reference type,  
 you might expect the data members to look like:

 struct Appender(T)
 {
     uint capacity;
     T[] data;
 }

 Now, if you copy an appender to another instance, it gets its *own* copy  
 of capacity.  You append to a1, no problems.  You then append to a2 and  
 it overwrites the data you put in a1.

 It might be possible to do an unsafe appender that uses a pointer to a  
 stack variable for its implementation.  But returning such an appender  
 would escape stack data.  This would however obviate the need to  
 allocate extra data on the heap.

 A final option is to disable the copy constructor of such an unsafe  
 appender, but then you couldn't pass it around.

 What do you think?  If you think it's worth having, suggest it on the  
 phobos mailing list, and we'll discuss.

 Note that Appender is supposed to be fast at *appending* not  
 initializing itself.  In that respect, it's very fast.

  I'm not sure it's worth the trade-off, and as such I defined and use  
 my own set of primitives that don't allocate when a buffer is provided:

 void put(T)(ref T[] array, ref size_t offset, const(T) value)
 {
      ensureCapacity(array, offset + 1);
      array[offset++] = value;
 }

 void put(T)(ref T[] array, ref size_t offset, const(T)[] value)
 {
      // Same but for an array
 }

 void ensureCapacity(ref char[] array, size_t minCapacity)
 {
     // ...
 }

 I'm not sure what ensureCapacity does, but if it does what I think it  
 does (use the capacity property of arrays), it's probably slower than  
 Appender, which has a dedicated variable for capacity.

No, it doesn't use capacity, it uses length as a capacity instead:

void ensureCapacity(T)(ref T[] array, size_t minCapacity)
{
	size_t capacity = array.length;
	if (minCapacity < capacity) {
		return;
	}
	
	// need resize
	capacity *= 2;
	
	if (capacity < 16) {
		capacity = 16;
	}

	if (capacity < minCapacity) {
		capacity = minCapacity;
	}

	array.length = capacity;
}

The usage pattern is as follows:

dchar[] toUTF32(string s, dchar[] buffer = null)
{
	size_t size = 0;
	foreach (dchar d; s) {
		buffer.put(size, d);
	}

	return buffer[0..size];
}

 Back to my original question, can we mimick a reference behavior with a  
 struct? I thought why not until I hit this bug:

 import std.array;
 import std.stdio;

 void append(Appender!(string) a, string s)
 {
 	a.put(s);
 }

 void main()
 {
 	Appender!(string) a;
 	string s = "test";
 	
 	append(a, s); // <
 	
 	writeln(a.data);	
 }

 I'm passing an appender by value since it's supposed to have a  
 reference type behavior and passing 4 bytes by reference is an overkill.

 However, the code above doesn't work for a simple reason: structs lack  
 default ctors. As such, an appender is initialized to null internally,  
 when I call append a copy of it gets initialized (lazily), but the  
 original one remains unchanged. Note that if you append to appender at  
 least once before passing by value, it will work. But that's sad. Not  
 only it allocates when it shouldn't, I also have to initialize it  
 explicitly!

 I think far better solution would be to make it non-copyable.

 TL;DR Reference semantic mimicking with a struct without default ctors  
 is unreliable since you must initialize your object lazily. Moreover,  
 you have to check that you struct is not initialized yet every single  
 function call, and that's error prone and bad for code clarity and  
 performance. I'm opposed of that practice.

 This is a point I've brought up before.  As of yet there is no  
 solution.  There have been a couple of ideas passed around, but there  
 hasn't been anything decided.  The one idea I remember (but didn't  
 really like) is to have the copy constructor be able to modify the  
 original.  This makes it possible to allocate the underlying  
 implementation in Appender for example, even on the data being passed.   
 There are lots of problems with this solution, and I don't think it got  
 much traction.

 I think the default constructor solution is probably never going to  
 happen.  It's very nice to always have a default fast way to initialize  


 My suggestion would be to have it be an actual reference type -- i.e. a  
 class.  I don't see any issues with that.  In that respect, you could  
 even have it be stack-allocated, since you have emplace.  But I don't  
 have a say in that.  I was the last one to update Appender, since it had  
 a bug-ridden design and needed to be fixed, but I tried to change as  
 little as possible.

 -Steve

Oct 16 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 16 Oct 2010 12:23:50 -0400, Denis Koroskin <2korden gmail.com>  
wrote:

 No, it doesn't use capacity, it uses length as a capacity instead:
  void ensureCapacity(T)(ref T[] array, size_t minCapacity)
 {
 	size_t capacity = array.length;
 	if (minCapacity < capacity) {
 		return;
 	}
 	
 	// need resize
 	capacity *= 2;
 	
 	if (capacity < 16) {
 		capacity = 16;
 	}
  	if (capacity < minCapacity) {
 		capacity = minCapacity;
 	}
  	array.length = capacity;
 }
  The usage pattern is as follows:
  dchar[] toUTF32(string s, dchar[] buffer = null)
 {
 	size_t size = 0;
 	foreach (dchar d; s) {
 		buffer.put(size, d);
 	}
  	return buffer[0..size];
 }

Oh, ok.  So you are keeping track of the length in a local variable.  That  
certainly works for specific applications, but Appender is supposed to be  
generally useful.

Like I said, An unsafe appender could be added to phobos which does the  
same.

-Steve

Oct 16 2010

"Denis Koroskin" <2korden gmail.com> writes:

Sorry, I misclicked a button and send the message preliminary.

On Sat, 16 Oct 2010 20:16:40 +0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 A final option is to disable the copy constructor of such an unsafe  
 appender, but then you couldn't pass it around.

 What do you think?  If you think it's worth having, suggest it on the  
 phobos mailing list, and we'll discuss.

It's still possible to pass it by reference, or even by pointer. You know,  
that's what you actually do right now - you are passing a Data* (a pointer  
to an internal state, wrapped with an Appender struct).
Passing by pointer might actually be a good idea (because you can default  
it to null). One of the reasons I use "T[] buffer = null" as a buffer is  
because you aren't force to provide one, null is also a valid buffer. Many  
function would benefit of passing optional Appender (e.g. converting from  
utf8 to utf16 etc), but we shouldn't force them to do so.

 Note that Appender is supposed to be fast at *appending* not  
 initializing itself.  In that respect, it's very fast.

This makes it useless for appending small amount of data.

  I'm not sure it's worth the trade-off, and as such I defined and use  
 my own set of primitives that don't allocate when a buffer is provided:

 void put(T)(ref T[] array, ref size_t offset, const(T) value)
 {
      ensureCapacity(array, offset + 1);
      array[offset++] = value;
 }

 void put(T)(ref T[] array, ref size_t offset, const(T)[] value)
 {
      // Same but for an array
 }

 void ensureCapacity(ref char[] array, size_t minCapacity)
 {
     // ...
 }

 I'm not sure what ensureCapacity does, but if it does what I think it  
 does (use the capacity property of arrays), it's probably slower than  
 Appender, which has a dedicated variable for capacity.

 Back to my original question, can we mimick a reference behavior with a  
 struct? I thought why not until I hit this bug:

 import std.array;
 import std.stdio;

 void append(Appender!(string) a, string s)
 {
 	a.put(s);
 }

 void main()
 {
 	Appender!(string) a;
 	string s = "test";
 	
 	append(a, s); // <
 	
 	writeln(a.data);	
 }

 I'm passing an appender by value since it's supposed to have a  
 reference type behavior and passing 4 bytes by reference is an overkill.

 However, the code above doesn't work for a simple reason: structs lack  
 default ctors. As such, an appender is initialized to null internally,  
 when I call append a copy of it gets initialized (lazily), but the  
 original one remains unchanged. Note that if you append to appender at  
 least once before passing by value, it will work. But that's sad. Not  
 only it allocates when it shouldn't, I also have to initialize it  
 explicitly!

 I think far better solution would be to make it non-copyable.

 TL;DR Reference semantic mimicking with a struct without default ctors  
 is unreliable since you must initialize your object lazily. Moreover,  
 you have to check that you struct is not initialized yet every single  
 function call, and that's error prone and bad for code clarity and  
 performance. I'm opposed of that practice.

 This is a point I've brought up before.  As of yet there is no  
 solution.  There have been a couple of ideas passed around, but there  
 hasn't been anything decided.  The one idea I remember (but didn't  
 really like) is to have the copy constructor be able to modify the  
 original.  This makes it possible to allocate the underlying  
 implementation in Appender for example, even on the data being passed.   
 There are lots of problems with this solution, and I don't think it got  
 much traction.

 I think the default constructor solution is probably never going to  
 happen.  It's very nice to always have a default fast way to initialize  


I think there is, but it goes far beyond default ctors problem (it solves  
many other issues, too).
Currently, a struct is initialized with T.init/T.classinfo.init
Pros:
simple initialization - malloc, followed by memcpy
there is always an immutable instance of an object in memory, and you can  
use it as default/not initialized state

Cons:
you can't initialize class/struct variables with runtime values
increased file size (every single class/struct now has a copy of its own)

In Java, they use another approach. Instead of memcpy'ing T.init on top of  
allocated data, they invoke a so-called cctor (as opposed to ctor). This  
is a method that initializes memory so that a ctor can be called.  
memcpy'ing T.init has the same idea, however it is not moved into a  
separate method. In general, cctor can be implemented the way it is in D  
without sacrificing anything. However, a type-unique method is a lot  
better than that:

1) most structs initialize all of its members with 0. For these compiler  
can use memset instead.
2) killer-feature in my opinion. It allows initializing values to  
non-constant expressions:

class Foo
{
	ubyte[] buffer = new ubyte[BUFFER_SIZE];
}

This also solves an Appender issue:

struct Appender
{
	Data* data = new Data();
}

3) it allows getting rid of T.init, significantly reducing resulting file  
size

I'm not sure Walter will agree to such a radical change, but it can be  
achieved in small steps. D doesn't even have to get rid of T.init, it can  
still be there (but I'd like to get rid of it eventually)

a) Keep T.init/T.classinfo.init, introduce compiler-generated cctor what  
memcpy'ies T.init over the object
(Optionally) Make cctor more smart, and generate proper class/struct  
initialization code that doesn't rely on T.init
b) Allow non-constant expressions as initializers and initialize such  
members in the cctor
(Optionally) Get rid of T.init altogether

 My suggestion would be to have it be an actual reference type -- i.e. a  
 class.  I don't see any issues with that.  In that respect, you could  
 even have it be stack-allocated, since you have emplace.  But I don't  
 have a say in that.  I was the last one to update Appender, since it had  
 a bug-ridden design and needed to be fixed, but I tried to change as  
 little as possible.

 -Steve

Oct 16 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 16 Oct 2010 12:59:46 -0400, Denis Koroskin <2korden gmail.com>  
wrote:

 Sorry, I misclicked a button and send the message preliminary.

 On Sat, 16 Oct 2010 20:16:40 +0400, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 A final option is to disable the copy constructor of such an unsafe  
 appender, but then you couldn't pass it around.

 What do you think?  If you think it's worth having, suggest it on the  
 phobos mailing list, and we'll discuss.

 It's still possible to pass it by reference, or even by pointer. You  
 know, that's what you actually do right now - you are passing a Data* (a  
 pointer to an internal state, wrapped with an Appender struct).

Yes, doing it this way forces you to use a pointer, since you can't pass  
by value.  That is the point.  To create a type with the property "if you  
don't pass it around correctly, it might blow up in your face" doesn't  
make much sense.  This is why I'd recommend using class for Appender,  
which also forces reference semantics, but does not use lazy construction.

 Note that Appender is supposed to be fast at *appending* not  
 initializing itself.  In that respect, it's very fast.

 This makes it useless for appending small amount of data.

Any generally usable appender is going to have some startup cost, so yes  
the overhead is going to make it non-optimal for small appends.  Use ~=  
for small amounts of data or use your method (writing directly to a  
buffer).  Appender is for appending large amounts of data.

 This is a point I've brought up before.  As of yet there is no  
 solution.  There have been a couple of ideas passed around, but there  
 hasn't been anything decided.  The one idea I remember (but didn't  
 really like) is to have the copy constructor be able to modify the  
 original.  This makes it possible to allocate the underlying  
 implementation in Appender for example, even on the data being passed.   
 There are lots of problems with this solution, and I don't think it got  
 much traction.

 I think the default constructor solution is probably never going to  
 happen.  It's very nice to always have a default fast way to initialize  


 I think there is, but it goes far beyond default ctors problem (it  
 solves many other issues, too).
 Currently, a struct is initialized with T.init/T.classinfo.init
 Pros:
 simple initialization - malloc, followed by memcpy
 there is always an immutable instance of an object in memory, and you  
 can use it as default/not initialized state

 Cons:
 you can't initialize class/struct variables with runtime values
 increased file size (every single class/struct now has a copy of its own)

 In Java, they use another approach. Instead of memcpy'ing T.init on top  
 of allocated data, they invoke a so-called cctor (as opposed to ctor).  
 This is a method that initializes memory so that a ctor can be called.  
 memcpy'ing T.init has the same idea, however it is not moved into a  
 separate method. In general, cctor can be implemented the way it is in D  
 without sacrificing anything. However, a type-unique method is a lot  
 better than that:

 1) most structs initialize all of its members with 0. For these compiler  
 can use memset instead.
 2) killer-feature in my opinion. It allows initializing values to  
 non-constant expressions:

 class Foo
 {
 	ubyte[] buffer = new ubyte[BUFFER_SIZE];
 }

 This also solves an Appender issue:

 struct Appender
 {
 	Data* data = new Data();
 }

 3) it allows getting rid of T.init, significantly reducing resulting  
 file size

 I'm not sure Walter will agree to such a radical change, but it can be  
 achieved in small steps. D doesn't even have to get rid of T.init, it  
 can still be there (but I'd like to get rid of it eventually)

 a) Keep T.init/T.classinfo.init, introduce compiler-generated cctor what  
 memcpy'ies T.init over the object
 (Optionally) Make cctor more smart, and generate proper class/struct  
 initialization code that doesn't rely on T.init
 b) Allow non-constant expressions as initializers and initialize such  
 members in the cctor
 (Optionally) Get rid of T.init altogether

This does sound promising.  I think we would need to try and make the  
'cctor' in D be very simple (low cost) otherwise you'll see issues when  
you for example allocate an array of structs.

So for example, you might only allow memory allocation and assignment.   
That would probably be enough for most cases, and would be (hopefully)  
fast enough to be not-noticable.  Not only that, but since the compiler is  
in charge of creating the cctor, it might be able to do some  
optimizations, like if you are allocating an array of Appenders, it can  
bulk construct all the data members required (i.e. take the GC lock only  
once).

Andrei, Walter?

-Steve

Oct 16 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Is mimicking a reference type with a struct reliable?