www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Memory allocation in D (noob question)

reply mandel <oh no.es> writes:
It probably is a noob question,
but aren't array lengths just hidden size_t values
that are passed around?
Why do we need to allocate space for them, too?

voif foo()
{
  size_t length;
  char* ptr; //allocated memory of 2^n
  //.. the same as..?
  char[] data;
}
Nov 30 2007
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"mandel" <oh no.es> wrote in message news:fiqu9l$18v$1 digitalmars.com...
 It probably is a noob question,
 but aren't array lengths just hidden size_t values
 that are passed around?
 Why do we need to allocate space for them, too?

 voif foo()
 {
  size_t length;
  char* ptr; //allocated memory of 2^n
  //.. the same as..?
  char[] data;
 }

What? I mean, yes, a size_t and a pointer will be the same size as an array reference, but the point of an array reference is that, well, it's an array reference. And you can do all kinds of things with them that you can't with pointers. What are you getting at?
Nov 30 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 12/1/07, mandel <oh no.es> wrote:
 Why do we need to allocate space for them, too?

Raw pointers are discouraged in modern languages such as D. They are the source of too many bugs. Use them if you need to do down-and-dirty, under-the-hood stuff, but for general use, forget pointers. Use arrays. They're safer.
Nov 30 2007
prev sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
mandel wrote:
 It probably is a noob question,
 but aren't array lengths just hidden size_t values
 that are passed around?
 Why do we need to allocate space for them, too?
 
 voif foo()
 {
   size_t length;
   char* ptr; //allocated memory of 2^n
   //.. the same as..?
   char[] data;
 }

The extra space allocated isn't for the length (in fact, it's just a byte I think); it's to make checking for array bounds errors possible (since there's a byte of space that, if accessed, indicates an overflow). I tmight be used for something else, too.
Nov 30 2007
next sibling parent reply mandel <oh no.es> writes:
Robert Fraser Wrote:

 mandel wrote:
 It probably is a noob question,
 but aren't array lengths just hidden size_t values
 that are passed around?
 Why do we need to allocate space for them, too?
 
 voif foo()
 {
   size_t length;
   char* ptr; //allocated memory of 2^n
   //.. the same as..?
   char[] data;
 }

The extra space allocated isn't for the length (in fact, it's just a byte I think); it's to make checking for array bounds errors possible (since there's a byte of space that, if accessed, indicates an overflow). I tmight be used for something else, too.

Thanks, that answers my question. But I can't think how it could be used for array bounds errors checking right now. Well, I guess there some ng post about this, somewhere. But the page allocation overhead looks ugly for a language like D. Anyway, good to have D arrays. Working with pointers in C was often ready for surprises in case of reduced attention. :>
Dec 01 2007
next sibling parent Don Clugston <dac nospam.com.au> writes:
mandel wrote:
 Robert Fraser Wrote:
 
 mandel wrote:
 It probably is a noob question,
 but aren't array lengths just hidden size_t values
 that are passed around?
 Why do we need to allocate space for them, too?

 voif foo()
 {
   size_t length;
   char* ptr; //allocated memory of 2^n
   //.. the same as..?
   char[] data;
 }

byte I think); it's to make checking for array bounds errors possible (since there's a byte of space that, if accessed, indicates an overflow). I tmight be used for something else, too.

Thanks, that answers my question. But I can't think how it could be used for array bounds errors checking right now. Well, I guess there some ng post about this, somewhere. But the page allocation overhead looks ugly for a language like D. Anyway, good to have D arrays. Working with pointers in C was often ready for surprises in case of reduced attention. :>

An observation... In my experience, most pointer bugs are actually uninitialised variables. An uninitialised pointer is a truly horrible thing. But since D initialises variables, pointers in D aren't nearly as bad as in C.
Dec 01 2007
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"mandel" wrote in message
 Robert Fraser Wrote:

 mandel wrote:
 It probably is a noob question,
 but aren't array lengths just hidden size_t values
 that are passed around?
 Why do we need to allocate space for them, too?

 voif foo()
 {
   size_t length;
   char* ptr; //allocated memory of 2^n
   //.. the same as..?
   char[] data;
 }

The extra space allocated isn't for the length (in fact, it's just a byte I think); it's to make checking for array bounds errors possible (since there's a byte of space that, if accessed, indicates an overflow). I tmight be used for something else, too.

Thanks, that answers my question. But I can't think how it could be used for array bounds errors checking right now. Well, I guess there some ng post about this, somewhere. But the page allocation overhead looks ugly for a language like D.

Think of it this way: int[] array1 = new int[5]; int[] array2 = new int[5]; imagine that array 1 and array 2 are now sequential in memory *AND* there is no extra byte separating them. Now I create the valid array slices: int[] array3 = array1[$..$]; int[] array4 = array2[0..0]; Note that both of these arrays are bit-for-bit identical (both have 0 length and the same ptr value). Which one points to which piece of memory? How is the GC to decide which memory gets collected? These are the types of problems that the extra byte helps with. I personally think there exists a way to fix this efficiently without adding the extra byte, but I can't think of one :) Oh, and also, the size_t length is not stored in the allocated memory. It's stored in the array structure, usually on the stack or inside a class instance. I hope this helps your understanding of the issue. -Steve
Dec 03 2007
next sibling parent Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 "mandel" wrote in message
 Robert Fraser Wrote:

 mandel wrote:
 It probably is a noob question,
 but aren't array lengths just hidden size_t values
 that are passed around?
 Why do we need to allocate space for them, too?

 voif foo()
 {
   size_t length;
   char* ptr; //allocated memory of 2^n
   //.. the same as..?
   char[] data;
 }

byte I think); it's to make checking for array bounds errors possible (since there's a byte of space that, if accessed, indicates an overflow). I tmight be used for something else, too.

But I can't think how it could be used for array bounds errors checking right now. Well, I guess there some ng post about this, somewhere. But the page allocation overhead looks ugly for a language like D.

Think of it this way: int[] array1 = new int[5]; int[] array2 = new int[5]; imagine that array 1 and array 2 are now sequential in memory *AND* there is no extra byte separating them. Now I create the valid array slices: int[] array3 = array1[$..$]; int[] array4 = array2[0..0]; Note that both of these arrays are bit-for-bit identical (both have 0 length and the same ptr value). Which one points to which piece of memory? How is the GC to decide which memory gets collected? These are the types of problems that the extra byte helps with. I personally think there exists a way to fix this efficiently without adding the extra byte, but I can't think of one :) Oh, and also, the size_t length is not stored in the allocated memory. It's stored in the array structure, usually on the stack or inside a class instance.

This is true in D 1.0. However, there has been talk that arrays in D 2.0 would change from: struct Array { size_t length; byte* ptr; } to: struct Array { byte* ptr; byte* end; } Which would make every array reference always point to itself and to the block immediately following it in memory, if no padding is done. Sean
Dec 03 2007
prev sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
mandel wrote:
 Steven Schveighoffer wrote:
 [..]
 Now I create the valid array slices:

 int[] array3 = array1[$..$];
 int[] array4 = array2[0..0];

 Note that both of these arrays are bit-for-bit identical (both have 0
 length and the same ptr value).  Which one points to which piece of
 memory?  How is the GC to decide which memory gets collected?

The first possible solution that comes to my mind seeing this is to make array1[0..0] and array1[$..$] equal. array1[$..$] could point to the begin of the array. Since the slice length is null, it shouldn't matter - would it?

Appending to a (empty or not) array slice starting at the start of an allocated block appends in-place rather than allocate a new array. This is the reason while(x) a ~= b; can be reasonably efficient. So appending to the [$..$] array would (without padding) mean that you corrupt the following array. The upcoming D2 T[new] (hopefully T[*] :) ) array type will probably make that a non-issue though.
 Second thought, why not ignore empty slices at all by telling
 the GC that the pointers doesn't hold any data.

Except for the fact that having an empty slice at the start of an allocated block is needed for appending to a preallocated block in current D, the reason is that the current GC doesn't have that fine grained information. It currently only knows "this block might contain pointers" and "this block doesn't contain pointers", and in the former case, treats everything properly aligned as potential pointers. -- Oskar
Dec 03 2007
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Oskar Linde" wrote
 mandel wrote:
 Steven Schveighoffer wrote:
 [..]
 Now I create the valid array slices:

 int[] array3 = array1[$..$];
 int[] array4 = array2[0..0];

 Note that both of these arrays are bit-for-bit identical (both have 0
 length and the same ptr value).  Which one points to which piece of
 memory?  How is the GC to decide which memory gets collected?

The first possible solution that comes to my mind seeing this is to make array1[0..0] and array1[$..$] equal. array1[$..$] could point to the begin of the array. Since the slice length is null, it shouldn't matter - would it?

Appending to a (empty or not) array slice starting at the start of an allocated block appends in-place rather than allocate a new array. This is the reason while(x) a ~= b; can be reasonably efficient.

Hm... I think you are slightly incorrect. I think the array is appended to in place ONLY if the data after the slice is unallocated. In this case, it would be allocated, so the array would be re-allocated elsewhere.
 So appending to the [$..$] array would (without padding) mean that you 
 corrupt the following array.

I think this is incorrect for the reasons I stated above. An allocated block should never be re-assigned to another array. Maybe I am wrong, but I think mandel might have a possible solution to this problem. If you slice an empty array (or even allocate an empty array), set the pointer to null. No reason to allocate an empty array, and no reason you need to keep memory around for it. If you append to it, it's going to be like appending to an init array anyways. That would also make null comparisons more consistent like: int[] array1 = array2[0..0]; if(array1 is null) // evaluates to true! ... -Steve
Dec 04 2007
next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Steven Schveighoffer wrote:
 "Oskar Linde" wrote
 mandel wrote:
 Steven Schveighoffer wrote:
 [..]
 Now I create the valid array slices:

 int[] array3 = array1[$..$];
 int[] array4 = array2[0..0];

 Note that both of these arrays are bit-for-bit identical (both have 0
 length and the same ptr value).  Which one points to which piece of
 memory?  How is the GC to decide which memory gets collected?

The first possible solution that comes to my mind seeing this is to make array1[0..0] and array1[$..$] equal. array1[$..$] could point to the begin of the array. Since the slice length is null, it shouldn't matter - would it?

allocated block appends in-place rather than allocate a new array. This is the reason while(x) a ~= b; can be reasonably efficient.

Hm... I think you are slightly incorrect. I think the array is appended to in place ONLY if the data after the slice is unallocated. In this case, it would be allocated, so the array would be re-allocated elsewhere.

Try this: char[] ab = "ab".dup; char[] a = ab[0..1]; a ~= "c"; writefln("ab = ",ab);
 So appending to the [$..$] array would (without padding) mean that you 
 corrupt the following array.

I think this is incorrect for the reasons I stated above. An allocated block should never be re-assigned to another array.

See my example above. The allocated block is deduced from the slice .ptr. If the pointer points at the start of another array, DMD would have no way of knowing it isn't a slice of that other array.
 Maybe I am wrong, but I think mandel might have a possible solution to this 
 problem.  If you slice an empty array (or even allocate an empty array), set 
 the pointer to null.  No reason to allocate an empty array, and no reason 
 you need to keep memory around for it.  If you append to it, it's going to 
 be like appending to an init array anyways.  That would also make null 
 comparisons more consistent like:
 
 int[] array1 = array2[0..0];
 
 if(array1 is null) // evaluates to true!
 ...

There are several cases where it is useful to retain the pointer when the array is of zero length, for example a zero length regexp sub-expression match. It used to be the case that setting a slice length to 0 (via the length property) made the pointer null as well. That changed a while ago. -- Oskar
Dec 04 2007
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Oskar Linde" wrote
 Try this:

         char[] ab = "ab".dup;
         char[] a = ab[0..1];
         a ~= "c";
         writefln("ab = ",ab);

Outputs "ac" So this appears to be a bug then. Because from the spec: "Concatenation always creates a copy of its operands, even if one of the operands is a 0 length array" and then for the append operator example: "a ~= b; // a becomes the concatenation of a and b" Changing the line to a = a ~ "c" changes the output of the program to "ab". Another reason why this is seems to be a bug and NOT a feature: string ab = "ab".idup; string a = ab[0..1]; a ~= "c"; writefln("ab = ",ab); // also outputs "ac" This changes an invariant string without compiler complaint! Note that Tango has this problem too.
 So appending to the [$..$] array would (without padding) mean that you 
 corrupt the following array.

I think this is incorrect for the reasons I stated above. An allocated block should never be re-assigned to another array.

See my example above. The allocated block is deduced from the slice .ptr. If the pointer points at the start of another array, DMD would have no way of knowing it isn't a slice of that other array.

I think your example exposes a bug, and does not agree with what the spec says. There seems to be a silent agreement among everyone that D should behave that way, but I can't find anything in the spec that states it should. Is this something that is planned to be fixed or at least described correctly in the spec? If someone desires this behavior, I would say that it's possible to keep a reference to the entire array and use the copy operator. i.e.: ab[1..$] = "c"; perhaps there could be another way to extend the slice if more buffer space exists? With the caveat that you know that if you have other references to that data, they could be changed too? I can see the usefulness of using an array as a buffer which keeps its allocated space as it shrinks, but this is not worth having x ~= y not mean the same thing as x = x ~ y. The current meaning is too error prone in my opinion. -Steve
Dec 04 2007
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 "Oskar Linde" wrote
 Try this:

         char[] ab = "ab".dup;
         char[] a = ab[0..1];
         a ~= "c";
         writefln("ab = ",ab);

Outputs "ac" So this appears to be a bug then. Because from the spec: "Concatenation always creates a copy of its operands, even if one of the operands is a 0 length array"

Then the spec is wrong. The current behavior is very deliberate, insofar as the code is concerned. Look at internal/gc/gc.d. It is also deliberate for interior slices to always reallocate on an append. But the runtime has no way to know whether something pointing to the head of a block is a slice or is the original array. I've never actually found this to be a problem in practice, and I'll admit to having used the slice terminology from time to time, because it's more succinct than resizing using the length property.
 So appending to the [$..$] array would (without padding) mean that you 
 corrupt the following array.

block should never be re-assigned to another array.

If the pointer points at the start of another array, DMD would have no way of knowing it isn't a slice of that other array.

I think your example exposes a bug, and does not agree with what the spec says.

The spec also says that D 1.0 has inheritable contracts, and maybe we will one day, but it's not even on the radar at the moment. For better or worse, I've learned not to put much stock in what the spec says about some things.
 There seems to be a silent agreement among everyone that D should behave 
 that way, but I can't find anything in the spec that states it should.  Is 
 this something that is planned to be fixed or at least described correctly 
 in the spec?
 
 If someone desires this behavior, I would say that it's possible to keep a 
 reference to the entire array and use the copy operator.  i.e.:
 
 ab[1..$] = "c";
 
 perhaps there could be another way to extend the slice if more buffer space 
 exists?

It would be easy to allow all slices to be extended in place, even the interior ones. But going the other direction would be difficult. The proposed T[new] syntax might help in that direction, but I hate it. Sean
Dec 04 2007
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
BTW, here are a list of tango classes that can corrupt data that was not 
passed to them.  These should probably be fixed:

tango.net.ftp.FtpClient
tango.net.cluster.NetworkCall
tango.util.log.Hierarchy
tango.stdc.stringz

I haven't looked through phobos, but I'm sure there are instances which have 
this problem as well.

-Steve 
Dec 05 2007
parent Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 BTW, here are a list of tango classes that can corrupt data that was not 
 passed to them.  These should probably be fixed:
 
 tango.net.ftp.FtpClient
 tango.net.cluster.NetworkCall
 tango.util.log.Hierarchy
 tango.stdc.stringz
 
 I haven't looked through phobos, but I'm sure there are instances which have 
 this problem as well.

Thanks, I'll look into these. Sean
Dec 05 2007
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Steven Schveighoffer" wrote
 Another reason why this is seems to be a bug and NOT a feature:

        string ab = "ab".idup;
        string a = ab[0..1];
        a ~= "c";
        writefln("ab = ",ab); // also outputs "ac"

 This changes an invariant string without compiler complaint!

more bugs :) import std.stdio; struct X { char[5] myArray; int x; } void main() { X[] x = new X[2]; x[0].myArray[] = "hello"; char[] myslice = x[0].myArray[0..3]; writefln("%x %x %x", &x[0].x, &x[0].myArray[0], &myslice[0]); myslice ~= "hithere"; writefln("%x %x %x", &x[0].x, &x[0].myArray[0], &myslice[0]); writefln("%s %d", x[0].myArray, x[0].x); } output: 868FE8 868FE0 868FE0 868FE8 868FE0 868FE0 helhi 25970 -Steve
Dec 04 2007
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 "Steven Schveighoffer" wrote
 Another reason why this is seems to be a bug and NOT a feature:

        string ab = "ab".idup;
        string a = ab[0..1];
        a ~= "c";
        writefln("ab = ",ab); // also outputs "ac"

 This changes an invariant string without compiler complaint!

more bugs :)

This is expected behavior. Sean
Dec 04 2007
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sean Kelly" wrote
 Steven Schveighoffer wrote:
 "Steven Schveighoffer" wrote
 Another reason why this is seems to be a bug and NOT a feature:

        string ab = "ab".idup;
        string a = ab[0..1];
        a ~= "c";
        writefln("ab = ",ab); // also outputs "ac"

 This changes an invariant string without compiler complaint!

more bugs :)

This is expected behavior.

Behavior by design, perhaps. Expected, I should hope not. I would never expect to be able to have one variable overwrite another without obvious casting. And why should it be 'expected behavior' for the GC to assume that because an array is at the beginning of a memory block, it is free to use any memory in that block? I think I've proven that there are cases where it should not assume that. I'm not saying there is a bug in the compiler implementation, or that the docs need to be changed to reflect the compiler behavior. I'm saying the design here is flat out wrong, and needs to be reflected in the compiler. My recommendation would be to make the ~= behave exactly as the spec says, that it always makes a copy of it's arguments. If you need buffer-like behavior for performance, write a new type. Isn't one of Walter's goal to prevent silent runtime errors? IMO, memory corruption errors are the worst kind of silent errors. -Steve
Dec 04 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Steven Schveighoffer wrote:
 "Steven Schveighoffer" wrote
 Another reason why this is seems to be a bug and NOT a feature:

        string ab = "ab".idup;
        string a = ab[0..1];
        a ~= "c";
        writefln("ab = ",ab); // also outputs "ac"

 This changes an invariant string without compiler complaint!




In this post I'm commenting on the example shown above, not the 2nd one (which to be honest is much more worrying). I am a bit confused as to which example Sean was saying was "expected behaviour".
 Behavior by design, perhaps.  Expected, I should hope not.  I would never 
 expect to be able to have one variable overwrite another without obvious 
 casting.  

Both variables above are references to the same data. You're using one variable to change that data, therefore the other variable which still refers to the same data, sees the changes. If the concatenation operation had to reallocate the memory it would produce a copy, and you wouldn't see the changes. So, this behaviour is non deterministic, however...
 And why should it be 'expected behavior' for the GC to assume that
 because an array is at the beginning of a memory block, it is free to use 
 any memory in that block?  I think I've proven that there are cases where it 
 should not assume that.

The assumption fits with D's (semi-)official "copy on write" policy. If you want to write to memory and you cannot be sure you are the only reference then you should copy the data before writing. Following this guideline makes the behaviour deterministic, and...
 I'm not saying there is a bug in the compiler implementation, or that the 
 docs need to be changed to reflect the compiler behavior.  I'm saying the 
 design here is flat out wrong, and needs to be reflected in the compiler. 
 My recommendation would be to make the ~= behave exactly as the spec says, 
 that it always makes a copy of it's arguments.  If you need buffer-like 
 behavior for performance, write a new type.

The current behaviour allows you to skip the copy step if you _know_ you hold the only reference to the data, it's putting the choice/power in the programmers hands. As always, power can be a dangerous thing if missused :)
 Isn't one of Walter's goal to prevent silent runtime errors?  IMO, memory 
 corruption errors are the worst kind of silent errors.

The example shown above is not corrupting any memory. The 2nd one (not shown above) seems to be and it worries me much more. Regan
Dec 05 2007
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Regan Heath" wrote
 Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Steven Schveighoffer wrote:
 "Steven Schveighoffer" wrote
 Another reason why this is seems to be a bug and NOT a feature:

        string ab = "ab".idup;
        string a = ab[0..1];
        a ~= "c";
        writefln("ab = ",ab); // also outputs "ac"

 This changes an invariant string without compiler complaint!




In this post I'm commenting on the example shown above, not the 2nd one (which to be honest is much more worrying). I am a bit confused as to which example Sean was saying was "expected behaviour".
 Behavior by design, perhaps.  Expected, I should hope not.  I would never 
 expect to be able to have one variable overwrite another without obvious 
 casting.

Both variables above are references to the same data. You're using one variable to change that data, therefore the other variable which still refers to the same data, sees the changes. If the concatenation operation had to reallocate the memory it would produce a copy, and you wouldn't see the changes. So, this behaviour is non deterministic, however...

The problem is that invariant data is changing. This is a no-no for pure functions which Walter has planned. If invariant data can change without violating the rules of the spec, then the compiler implementation or design is flawed. I think the design is what is flawed. I have several problems with this concat operator issue. First, that x ~= y does not effect the same behavior as x = x ~ y. This is a fundamental flaw in the language in my opinion. any operator of the op= form is supposed to mean the same as x = x op y. This is consistent throughout all of D, except in this case. Second, there is the issue of the spec. The spec clearly states that concatenation should result in a copy of both sides. Obviously, this isn't true in all cases. The spec should be changed for both D 1.x and 2.x IMMEDIATELY to prevent unsuspecting coders from using ~= when what they really want is just ~. Third, I have not seen this T[new] operator described anywhere, but I am concerned that D 1.0 will not be updated. This leaves all coders who are not ready to switch to D 2 at risk. But from the inferred behavior of T[new], I'm expecting that this will probably fix the problem. -Steve
Dec 05 2007
next sibling parent reply Regan Heath <regan netmail.co.nz> writes:
Steven Schveighoffer wrote:
 "Regan Heath" wrote
 Both variables above are references to the same data.  You're using one 
 variable to change that data, therefore the other variable which still 
 refers to the same data, sees the changes.

 If the concatenation operation had to reallocate the memory it would 
 produce a copy, and you wouldn't see the changes.

 So, this behaviour is non deterministic, however...

The problem is that invariant data is changing. This is a no-no for pure functions which Walter has planned. If invariant data can change without violating the rules of the spec, then the compiler implementation or design is flawed. I think the design is what is flawed.

That is another issue which I didn't even address. Assuming 'string' means 'invariant(char)' and assuming that means the char values cannot change (I say assuming because I haven't had the chance to really internalise the new const yet) then I reckon the implementation of invariant is simply broken/buggy.
 I have several problems with this concat operator issue.
 
 First, that x ~= y does not effect the same behavior as x = x ~ y.  This is 
 a fundamental flaw in the language in my opinion.  any operator of the op= 
 form is supposed to mean the same as x = x op y.  This is consistent 
 throughout all of D, except in this case.

The problem stems from the fact that x ~= y always assigns the result to x, whereas x ~ y can potentially be assigned to something else. This means the latter must create a new/temporary object to store the result. In the case of arrays this effectively means that x ~ y always creates a new array which is a copy of the old ones. But x ~= y need not create a new array as it can append to the existing one. The ~= form therefore allows an optimisation which is beneficial. Not allowing people to have both methods at their disposal would likely cause an outcry.
 Second, there is the issue of the spec.  The spec clearly states that 
 concatenation should result in a copy of both sides.  

1. The website cannot be trusted completely and is often behind the compiler when it comes to the spec. 2. It could be argued that "concatenation" is the x ~ y form and not the ~= form, which is called "append". From the website spec: "The binary operator ~ is the cat operator. It is used to concatenate arrays" "Similarly, the ~= operator means append" "Concatenation always creates a copy of its operands, even if one of the operands is a 0 length array" I'm probably splitting hairs here and I doubt there is much point arguing it - I just wanted to point out another way of reading the spec.
 Obviously, this isn't
 true in all cases.  The spec should be changed for both D 1.x and 2.x 
 IMMEDIATELY to prevent unsuspecting coders from using ~= when what they 
 really want is just ~.

 Third, I have not seen this T[new] operator described anywhere, but I am 
 concerned that D 1.0 will not be updated.  This leaves all coders who are 
 not ready to switch to D 2 at risk.  But from the inferred behavior of 
 T[new], I'm expecting that this will probably fix the problem.

Aside from the apparent invariant bug the only case which causes me a slight worry is the case involving a struct. The only solution I can imagine would be to somehow determine the memory was originally allocated to a 'struct' and therefore reallocation for an 'array' must cause a copy. I'm not sure what information the GC keeps on allocated blocks, I believe there is a pointers/nopointers flag and that could form the basis of a fairly crude test perhaps (as struct contains pointers and char[] does not); Even if nothing can be done to detect this case I'm not sure it's a huge issue, after all it only affects people using static arrays as the first member of a struct which they take a slice of and then modify (concatenate) without performing "copy on write" - which is a no no in D anyway. Regan
Dec 05 2007
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Regan Heath wrote:
 Steven Schveighoffer wrote:
 "Regan Heath" wrote
 Both variables above are references to the same data.  You're using 
 one variable to change that data, therefore the other variable which 
 still refers to the same data, sees the changes.

 If the concatenation operation had to reallocate the memory it would 
 produce a copy, and you wouldn't see the changes.

 So, this behaviour is non deterministic, however...

The problem is that invariant data is changing. This is a no-no for pure functions which Walter has planned. If invariant data can change without violating the rules of the spec, then the compiler implementation or design is flawed. I think the design is what is flawed.

That is another issue which I didn't even address. Assuming 'string' means 'invariant(char)' and assuming that means the char values cannot change (I say assuming because I haven't had the chance to really internalise the new const yet) then I reckon the implementation of invariant is simply broken/buggy.

I don't know that it's broken so much as potentially misleading. That example never actually changed any of the data in the string, it simply appended additional data to the string. Thus invariance of the data was preserved.
 Third, I have not seen this T[new] operator described anywhere, but I 
 am concerned that D 1.0 will not be updated.  This leaves all coders 
 who are not ready to switch to D 2 at risk.  But from the inferred 
 behavior of T[new], I'm expecting that this will probably fix the 
 problem.

Aside from the apparent invariant bug the only case which causes me a slight worry is the case involving a struct. The only solution I can imagine would be to somehow determine the memory was originally allocated to a 'struct' and therefore reallocation for an 'array' must cause a copy. I'm not sure what information the GC keeps on allocated blocks, I believe there is a pointers/nopointers flag and that could form the basis of a fairly crude test perhaps (as struct contains pointers and char[] does not);

That's pretty much it. And if the GC were to retain additional type info it would be tailored to finding pointers for collection purposes rather than determining whether one section of a block is a static array or an int.
 Even if nothing can be done to detect this case I'm not sure it's a huge 
 issue, after all it only affects people using static arrays as the first 
 member of a struct which they take a slice of and then modify 
 (concatenate) without performing "copy on write" - which is a no no in D 
 anyway.

Right. Sean
Dec 05 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Sean Kelly wrote:
 Regan Heath wrote:
 Steven Schveighoffer wrote:
 The problem is that invariant data is changing.  This is a no-no for 
 pure
 functions which Walter has planned.  If invariant data can change 
 without violating the rules of the spec, then the compiler 
 implementation or design is flawed.  I think the design is what is 
 flawed.

That is another issue which I didn't even address. Assuming 'string' means 'invariant(char)' and assuming that means the char values cannot change (I say assuming because I haven't had the chance to really internalise the new const yet) then I reckon the implementation of invariant is simply broken/buggy.

I don't know that it's broken so much as potentially misleading. That example never actually changed any of the data in the string, it simply appended additional data to the string. Thus invariance of the data was preserved.

[example pasted again for clarity]
 string ab = "ab".idup;
 string a = ab[0..1];
 a ~= "c";
 writefln("ab = ",ab); // also outputs "ac"

True for 'a' but when appending to 'a' it writes over the memory which 'ab' guarantees is invariant. So, in the general case any slice of invariant data which is shorter than the original invariant data can be used to overwrite the original invariant data after the slice. Perhaps ~= should be disabled for invariant arrays. Regan
Dec 05 2007
next sibling parent Sean Kelly <sean f4.ca> writes:
Regan Heath wrote:
 Sean Kelly wrote:
 Regan Heath wrote:
 Steven Schveighoffer wrote:
 The problem is that invariant data is changing.  This is a no-no for 
 pure
 functions which Walter has planned.  If invariant data can change 
 without violating the rules of the spec, then the compiler 
 implementation or design is flawed.  I think the design is what is 
 flawed.

That is another issue which I didn't even address. Assuming 'string' means 'invariant(char)' and assuming that means the char values cannot change (I say assuming because I haven't had the chance to really internalise the new const yet) then I reckon the implementation of invariant is simply broken/buggy.

I don't know that it's broken so much as potentially misleading. That example never actually changed any of the data in the string, it simply appended additional data to the string. Thus invariance of the data was preserved.

[example pasted again for clarity] > string ab = "ab".idup; > string a = ab[0..1]; > a ~= "c"; > writefln("ab = ",ab); // also outputs "ac" True for 'a' but when appending to 'a' it writes over the memory which 'ab' guarantees is invariant.

Oops, you're right.
 So, in the general case any slice of invariant data which is shorter 
 than the original invariant data can be used to overwrite the original 
 invariant data after the slice.
 
 Perhaps ~= should be disabled for invariant arrays.

Or perhaps it should always reallocate. I'd originally thought it actually did this based on something Walter said, but I misunderstood. Sean
Dec 05 2007
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Wed, 05 Dec 2007 16:18:46 +0000, Regan Heath wrote:

 [example pasted again for clarity]
 
  > string ab = "ab".idup;
  > string a = ab[0..1];
  > a ~= "c";
  > writefln("ab = ",ab); // also outputs "ac"

However, this is fine ... string ab = "ab"; string a = ab[0..1]; a ~= "c"; writefln("ab = ",ab); // outputs "ab" writefln("a = ",a); // outputs "ac" So it seems that the '.idup' property is affecting things. -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Dec 05 2007
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Derek Parnell" wrote
 On Wed, 05 Dec 2007 16:18:46 +0000, Regan Heath wrote:

 [example pasted again for clarity]

  > string ab = "ab".idup;
  > string a = ab[0..1];
  > a ~= "c";
  > writefln("ab = ",ab); // also outputs "ac"

However, this is fine ... string ab = "ab"; string a = ab[0..1]; a ~= "c"; writefln("ab = ",ab); // outputs "ab" writefln("a = ",a); // outputs "ac" So it seems that the '.idup' property is affecting things.

Yes, I noticed that too. However, it's simply the non-deterministic behavior of the ~= operator that is causing this. For literal strings, I suspect they are not allocated by the GC, and so the GC can't extend them, so the normal behavior kicks in. But idup is supposed to give me an invariant array. The code is still changing invariant data... -Steve
Dec 05 2007
parent Regan Heath <regan netmail.co.nz> writes:
Steven Schveighoffer wrote:
 "Derek Parnell" wrote
 On Wed, 05 Dec 2007 16:18:46 +0000, Regan Heath wrote:

 [example pasted again for clarity]

  > string ab = "ab".idup;
  > string a = ab[0..1];
  > a ~= "c";
  > writefln("ab = ",ab); // also outputs "ac"

string ab = "ab"; string a = ab[0..1]; a ~= "c"; writefln("ab = ",ab); // outputs "ab" writefln("a = ",a); // outputs "ac" So it seems that the '.idup' property is affecting things.

Yes, I noticed that too. However, it's simply the non-deterministic behavior of the ~= operator that is causing this. For literal strings, I suspect they are not allocated by the GC, and so the GC can't extend them, so the normal behavior kicks in. But idup is supposed to give me an invariant array. The code is still changing invariant data...

To me, all the behaviour is "normal" ;) I think you're right about the reason it copies in this case. I wonder if the solution is for the GC to keep a seperate list of memory blocks which are invariant... then on reallocate it simply ignores this list - resulting in the same behavior as the string literal case. Regan
Dec 06 2007
prev sibling parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Derek Parnell wrote:
 However, this is fine ...
 
  string ab = "ab";
  string a = ab[0..1];
  a ~= "c";
  writefln("ab = ",ab); // outputs "ab"
  writefln("a  = ",a);  // outputs "ac"
 
 So it seems that the '.idup' property is affecting things.

It's probably just a side effect of the fact that string literals are immutable. The compiler knows that it has to reallocate when appending to it, I guess? import std.stdio; void main() { int[] ab = [0, 1]; int[] a = ab[0..1]; a ~= 2; writefln(ab); writefln(a); } -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Dec 05 2007
parent Regan Heath <regan netmail.co.nz> writes:
Matti Niemenmaa wrote:
 Derek Parnell wrote:
 However, this is fine ...

  string ab = "ab";
  string a = ab[0..1];
  a ~= "c";
  writefln("ab = ",ab); // outputs "ab"
  writefln("a  = ",a);  // outputs "ac"

 So it seems that the '.idup' property is affecting things.

It's probably just a side effect of the fact that string literals are immutable. The compiler knows that it has to reallocate when appending to it, I guess?

I suspect you're right. I think the reason is that a string literal is not allocated in the same way as the result from idup and maybe does not appear in the GC's list of memory blocks. So, if the GC doesn't know about it, the GC does not reallocate but instead copies. Regan
Dec 06 2007
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Regan Heath" wrote
 2. It could be argued that "concatenation" is the x ~ y form and not the 
 ~= form, which is called "append".  From the website spec:

 "The binary operator ~ is the cat operator. It is used to concatenate 
 arrays"

 "Similarly, the ~= operator means append"

 "Concatenation always creates a copy of its operands, even if one of the 
 operands is a 0 length array"

Look at the example for append: "a ~= b; // a becomes the concatenation of a and b" This is the only explanation of what "append" does. Yes, we are splitting hairs, but they are important hairs to split :) Having the spec be accurate is important for not only compiler implementors (which right now doesn't matter much but might in the future) and to developers using D. Just a simple explanation of: append may or may not re-use the memory that the original array uses. Therefore you should not use the append operator unless you know the array to be appended to is a dynamic array and not a slice of a dynamic array. If this isn't the case, memory corruption can occur: (paste Oskar's example here) -Steve
Dec 05 2007
prev sibling parent Sean Kelly <sean f4.ca> writes:
Steven Schveighoffer wrote:
 "Regan Heath" wrote
 Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 Steven Schveighoffer wrote:
 "Steven Schveighoffer" wrote
 Another reason why this is seems to be a bug and NOT a feature:

        string ab = "ab".idup;
        string a = ab[0..1];
        a ~= "c";
        writefln("ab = ",ab); // also outputs "ac"

 This changes an invariant string without compiler complaint!




(which to be honest is much more worrying). I am a bit confused as to which example Sean was saying was "expected behaviour".
 Behavior by design, perhaps.  Expected, I should hope not.  I would never 
 expect to be able to have one variable overwrite another without obvious 
 casting.

variable to change that data, therefore the other variable which still refers to the same data, sees the changes. If the concatenation operation had to reallocate the memory it would produce a copy, and you wouldn't see the changes. So, this behaviour is non deterministic, however...

The problem is that invariant data is changing. This is a no-no for pure functions which Walter has planned. If invariant data can change without violating the rules of the spec, then the compiler implementation or design is flawed. I think the design is what is flawed.

One could argue that invariant data is changing because of a programmer error, but you have a point.
 I have several problems with this concat operator issue.
 
 First, that x ~= y does not effect the same behavior as x = x ~ y.  This is 
 a fundamental flaw in the language in my opinion.  any operator of the op= 
 form is supposed to mean the same as x = x op y.  This is consistent 
 throughout all of D, except in this case.
 
 Second, there is the issue of the spec.  The spec clearly states that 
 concatenation should result in a copy of both sides.  Obviously, this isn't 
 true in all cases.  The spec should be changed for both D 1.x and 2.x 
 IMMEDIATELY to prevent unsuspecting coders from using ~= when what they 
 really want is just ~.

Or the runtime could be changed to always copy. However, it would absolutely murder application performance for something like this: char[] buf; for( int i = 0; i < 1_000_000; ++i ) buf ~= 'a'; And looping on an append is a pretty typical use case, in my experience.
 Third, I have not seen this T[new] operator described anywhere, but I am 
 concerned that D 1.0 will not be updated.  This leaves all coders who are 
 not ready to switch to D 2 at risk.  But from the inferred behavior of 
 T[new], I'm expecting that this will probably fix the problem.

The T[new] syntax basically said that resizable arrays would be declared as T[new] and non-resizable slices would be declared as T[]. My major problem with this is that it would change the way normal arrays are declared, and break tons of code in the process. Sean
Dec 05 2007
prev sibling parent Sean Kelly <sean f4.ca> writes:
Regan Heath wrote:
 
 The example shown above is not corrupting any memory.  The 2nd one (not 
 shown above) seems to be and it worries me much more.

The same "copy on write" issue applies to each case, but I agree that the behavior of the second is certainly less appealing. And you're right that it can and will corrupt memory if used in this manner. A pointer to the head of the struct is equal to a pointer to the head of the array, so right now the runtime is assuming that the entire block belongs to the array, which is wrong. Unfortunately, there is little that can be done about this mechanically. When a slice of the struct array is taken, type information is lost, so the compiler doesn't even know there's a struct involved, and so would be unable to supply additional type information to the runtime. Sean
Dec 05 2007
prev sibling parent reply Regan Heath <regan netmail.co.nz> writes:
Steven Schveighoffer wrote:
 import std.stdio;
 
 struct X
 {
         char[5] myArray;
         int x;
 }
 
 void main()
 {
         X[] x = new X[2];
         x[0].myArray[] = "hello";
         char[] myslice = x[0].myArray[0..3];
         writefln("%x %x %x", &x[0].x, &x[0].myArray[0], &myslice[0]);
         myslice ~= "hithere";
         writefln("%x %x %x", &x[0].x, &x[0].myArray[0], &myslice[0]);
         writefln("%s %d", x[0].myArray, x[0].x);
 }
 
 output:
 
 868FE8 868FE0 868FE0
 868FE8 868FE0 868FE0
 helhi 25970

This one worries me. I believe the problem is caused by the memory address of myArray[0] being the same as the memory address of the struct. Is this what you realised Sean... I may be a bit slow on the uptake here :) When the slice needs to reallocate the GC checks this address and finds enough space following the struct (or perhaps it has allocated on a power of two boundary and already has enough) and it allows the concatenation to write to that memory. The problem is that it doesn't realise the memory was allocated to a struct, and is being reallocated by an array slice. So, the array concatenation overwrites the memory occupied by the int 'x'. Ick. I would have expected a static array to be un-reallocatable, so any concatenation performed on a slice of one to cause a copy to be made. But of course all that information is lost at the place where the reallocation is done, it's simply a memory address with a certain amount of memory associated with it. R
Dec 05 2007
parent Sean Kelly <sean f4.ca> writes:
Regan Heath wrote:
 Steven Schveighoffer wrote:
 import std.stdio;

 struct X
 {
         char[5] myArray;
         int x;
 }

 void main()
 {
         X[] x = new X[2];
         x[0].myArray[] = "hello";
         char[] myslice = x[0].myArray[0..3];
         writefln("%x %x %x", &x[0].x, &x[0].myArray[0], &myslice[0]);
         myslice ~= "hithere";
         writefln("%x %x %x", &x[0].x, &x[0].myArray[0], &myslice[0]);
         writefln("%s %d", x[0].myArray, x[0].x);
 }

 output:

 868FE8 868FE0 868FE0
 868FE8 868FE0 868FE0
 helhi 25970

This one worries me. I believe the problem is caused by the memory address of myArray[0] being the same as the memory address of the struct. Is this what you realised Sean... I may be a bit slow on the uptake here :)

Yes :-) Sean
Dec 05 2007
prev sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 12/5/07, Sean Kelly <sean f4.ca> wrote:
 Or the runtime could be changed to always copy.  However, it would
 absolutely murder application performance for something like this:

 char[] buf;
 for( int i = 0; i < 1_000_000; ++i )
      buf ~= 'a';

It would, /unless/ we had a vector type (a C++ std::vector, not a math vector). Java has something similar, I believe - immutable strings, but also a StringBuffer type. (I could be wrong about the details). Anyway, the point is, you'd just rewrite the above loop as: Vector!(char) buf; for( int i = 0; i < 1_000_000; ++i ) buf ~= 'a'; // // and when you're done // return buf.toArray(); That sort of thing.
Dec 05 2007
prev sibling parent mandel <oh no.es> writes:
Steven Schveighoffer wrote:
[..]
 Now I create the valid array slices:
 
 int[] array3 = array1[$..$];
 int[] array4 = array2[0..0];
 
 Note that both of these arrays are bit-for-bit identical (both have 0
 length and the same ptr value).  Which one points to which piece of
 memory?  How is the GC to decide which memory gets collected?

The first possible solution that comes to my mind seeing this is to make array1[0..0] and array1[$..$] equal. array1[$..$] could point to the begin of the array. Since the slice length is null, it shouldn't matter - would it? Second thought, why not ignore empty slices at all by telling the GC that the pointers doesn't hold any data. Anyway, I guess there some things I missed. ;-) [..]
 I hope this helps your understanding of the issue.

Dec 03 2007