www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Change representation of dynamic arrays?

reply Walter Bright <newshound1 digitalmars.com> writes:
Currently, arrays are represented under the hood as:

	size_t lengthOfArray;
	void* ptrToStartOfArray;

Which works out reasonably well. The problem is if you want to use array 
types as the basis of iterators, and you want to step through the array. 
There's no escaping it being two operations:

	decrement the length
	increment the pointer

This puts a brick in any fast implementation of iterators. To fix that, 
we can change the representation to:

	void* ptrToStartOfArray;
	void* ptrPastEndOfArray;

Then there's just one increment. Some tests show this can improve loop 
performance by up to 70%.

So, what does this not break?

1) Doesn't break array.ptr, this will still work.
2) Doesn't break array.length as rvalue, as this is rewritten by the 
compiler as (array.end - array.start).
3) Doesn't break array.length as an lvalue, as that is handled by the 
runtime library anyway.
4) Won't break anything on D 1.0, as it wouldn't get this change.
5) Won't break array slices, or any of that stuff we love about D arrays.

What does this break?

1) Passing dynamic arrays to printf as in:

	printf("my string is %*.s\n", str);

which relied on the under-the-hood representation. This doesn't work on 
some architectures anyway, and is thoroughly obsolete. One could quickly 
fix such code by writing it as:

	printf("my string is %*.s\n", str.length, str.ptr);

2) It breaks the internal library support code, but that's my problem.

3) It breaks binary compatibility with libraries already compiled. But 
we expect to break binary compatibility with D 2.0.

4) It breaks things like cast(ulong)str, if one was crazy enough to do 
that anyway.

5) It breaks anything that tries to look at the underlying 
representation of dynamic arrays - but such code should be rewritten to 
use .ptr and .length anyway, or slice notation.

So, what do you think?
Oct 19 2007
next sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Walter Bright wrote:
 [stuff]

I'm hesitant to break something like this which is basically invisible to the end user. Then again, a 70% speed increase is not something to turn one's nose up at! And, as you said, you D 2.0 was forked so that changes like this could be made. You say you've measured the performance of loops using this, but what about slicing code? I know that I use a lot of [lower..$-something] style code; how much slower does this become? Once I'm not so busy, I might try to work out the proportion of foreach to $ slicing in my code. :P But, if the gains are there, and there's nothing critical that it breaks that you haven't thought of, I can't see any reason not to do it. -- Daniel P.S. I assume that array.ptr = array.start, and we'll have access to array.end, yes?
Oct 19 2007
next sibling parent reply 0ffh <spam frankhirsch.net> writes:
Daniel Keep wrote:
 You say you've measured the performance of loops using this, but what
 about slicing code?  I know that I use a lot of [lower..$-something]
 style code; how much slower does this become?  Once I'm not so busy, I
 might try to work out the proportion of foreach to $ slicing in my code. :P

I guess we go from changing one pointer and a size_t to changing two pointers, I can't see a lot of space for that being slower than now. Consider: wchar[] str="this is a test"; int s=10; // slice start int e=14; // slice end tst=s[s..e]; Old array: tst.ptr=str.ptr+s*wchar.sizeof; tst.len=e-s; New array: tst.start=str.start+s*wchar.sizeof; tst.end=str.start+e*wchar.sizeof; Regards, Frank
Oct 19 2007
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
0ffh wrote:
 Daniel Keep wrote:
 You say you've measured the performance of loops using this, but what
 about slicing code?  I know that I use a lot of [lower..$-something]
 style code; how much slower does this become?  Once I'm not so busy, I
 might try to work out the proportion of foreach to $ slicing in my
 code. :P

I guess we go from changing one pointer and a size_t to changing two pointers, I can't see a lot of space for that being slower than now. Consider: wchar[] str="this is a test"; int s=10; // slice start int e=14; // slice end tst=s[s..e]; Old array: tst.ptr=str.ptr+s*wchar.sizeof; tst.len=e-s; New array: tst.start=str.start+s*wchar.sizeof; tst.end=str.start+e*wchar.sizeof; Regards, Frank

Well, that's definitely more work to compute end, although in most cases, the multiply could be reduced to a bit shift. I was talking about cases where we slice using the length of the array, since now we have to actually go and compute the length instead of simply looking it up. e = 2; tst = str[s .. $-e]; Old array: tst.ptr = str.ptr + s*wchar.sizeof; tst.len = (str.length-e) - s; New array: tst.start = str.start + s*wchar.sizeof; //tst.end = str.start + (str.length-e)*wchar.sizeof; tst.end = str.start + ((str.end-str.start)-e)*wchar.sizeof; So we've gone from two subtractions to an addition, two subtractions and a multiply; that's double the number of operations, one of which is an expensive multiply. If the primary justification for this change is eliminating a decrement operation for iteration, I think it's very relevant to look at how it's going to affect other code! -- Daniel
Oct 20 2007
parent Daniel Keep <daniel.keep.lists gmail.com> writes:
Janice Caron wrote:
 On 10/20/07, Janice Caron <caron800 googlemail.com> wrote:
 On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 tst.end   = str.start + ((str.end-str.start)-e)*wchar.sizeof;


See, the $ wouldn't expand to (str.end-str.start) as you implied, it

Ah yes, silly me for quoting something Walter wrote without thinking about it first. :P
 would expand to ((str.end-str.start)/wchar.sizeof). Obviously the two
 str.starts cancel each other out, so you can adjust the end pointer
 (relative to $) without looking at the start pointer at all.

Assuming the compiler is smart enough to do algebraic elimination. I think I'll also have to take a look at what code the compiler generates in these cases. -- Daniel
Oct 20 2007
prev sibling next sibling parent David Brown <dlang davidb.org> writes:
On Sat, Oct 20, 2007 at 01:11:44PM +1000, Daniel Keep wrote:

You say you've measured the performance of loops using this, but what
about slicing code?  I know that I use a lot of [lower..$-something]
style code; how much slower does this become?  Once I'm not so busy, I
might try to work out the proportion of foreach to $ slicing in my code. :P

I wouldn't expect the performance to change in this case. You're doing an addition to adjust the start pointer, and a subtraction to adjust the length. The new representation would do the same addition and subtraction. In fact, it looks like it would even generate the same code. David
Oct 19 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 tst.end   = str.start + ((str.end-str.start)-e)*wchar.sizeof;

tst.end = str.end - e*wchar.sizeof;
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Janice Caron <caron800 googlemail.com> wrote:
 On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 tst.end   = str.start + ((str.end-str.start)-e)*wchar.sizeof;

tst.end = str.end - e*wchar.sizeof;

See, the $ wouldn't expand to (str.end-str.start) as you implied, it would expand to ((str.end-str.start)/wchar.sizeof). Obviously the two str.starts cancel each other out, so you can adjust the end pointer (relative to $) without looking at the start pointer at all.
Oct 20 2007
prev sibling next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 but what
 about slicing code?  I know that I use a lot of [lower..$-something]

While this may be optimised for simple arrays, it poses much more of a problem for classes and structs which override opIndex() and opSlice(). The problem is that indeces in those functions are always specified as being relative to the beginning - there is no way to specify that an index should be considered relative to the end - so when you do a[0..$-1] in such a struct, the compiler has to convert $-1 into (a.length()-1). Then, inside opSlice(), to turn it back to a pointer you'd have to do (start+index). I know adds and subtracts are not that expensive, but given that this change is being talked about /because/ we want to get rid of a subtract, it is worth thinking about this. Presumably this will all sort itself out if we get iterators? When that happens, slice expressions could be rewritten by the compiler as iterator operations, which in turn would call opDereference() or opRange(), taking iterator parameters. Is that likely to happen, or will user-defined types always take the slow route?
Oct 20 2007
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Janice Caron wrote:
 On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 but what
 about slicing code?  I know that I use a lot of [lower..$-something]

While this may be optimised for simple arrays, it poses much more of a problem for classes and structs which override opIndex() and opSlice(). The problem is that indeces in those functions are always specified as being relative to the beginning - there is no way to specify that an index should be considered relative to the end - so when you do a[0..$-1] in such a struct, the compiler has to convert $-1 into (a.length()-1). Then, inside opSlice(), to turn it back to a pointer you'd have to do (start+index). I know adds and subtracts are not that expensive, but given that this change is being talked about /because/ we want to get rid of a subtract, it is worth thinking about this. Presumably this will all sort itself out if we get iterators? When that happens, slice expressions could be rewritten by the compiler as iterator operations, which in turn would call opDereference() or opRange(), taking iterator parameters. Is that likely to happen, or will user-defined types always take the slow route?

If you have a user-defined struct, you get to decide how it is implemented. How the built-in arrays are done under the hood shouldn't be relevant.
Oct 20 2007
parent Bill Baxter <dnewsgroup billbaxter.com> writes:
Walter Bright wrote:
 Janice Caron wrote:
 On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 but what
 about slicing code?  I know that I use a lot of [lower..$-something]

While this may be optimised for simple arrays, it poses much more of a problem for classes and structs which override opIndex() and opSlice(). The problem is that indeces in those functions are always specified as being relative to the beginning - there is no way to specify that an index should be considered relative to the end - so when you do a[0..$-1] in such a struct, the compiler has to convert $-1 into (a.length()-1). Then, inside opSlice(), to turn it back to a pointer you'd have to do (start+index).


'$' doesn't work in user defined types. So the rest of your comment is only relevant to some hypothetical future D where $ is allowed for user types. Andrei strongly favored making it mean .length everywhere if I recall correctly, so if anything happens with it that's probably what it will be, and your comment will be relevant.
 I know adds and subtracts are not that expensive, but given that this
 change is being talked about /because/ we want to get rid of a
 subtract, it is worth thinking about this.

 Presumably this will all sort itself out if we get iterators? When
 that happens, slice expressions could be rewritten by the compiler as
 iterator operations, which in turn would call opDereference() or
 opRange(), taking iterator parameters. Is that likely to happen, or
 will user-defined types always take the slow route?

If you have a user-defined struct, you get to decide how it is implemented. How the built-in arrays are done under the hood shouldn't be relevant.

It becomes relevant if $ becomes a synonym for .length as was previously proposed. In that case you don't get to decide how $ is implemented. It has to compute the length, even if that's O(N) and not really needed for computing the $-1'th node. Or if all you're trying to do is make a light wrapper around a built-in array (to, say, add an extra data member to it), then it means that your opIndex could be significantly less efficient than the built-in one. --bb
Oct 20 2007
prev sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Bill Baxter <dnewsgroup billbaxter.com> wrote:
 It becomes relevant if $ becomes a synonym for .length as was previously
 proposed.  In that case you don't get to decide how $ is implemented.
 It has to compute the length, even if that's O(N) and not really needed
 for computing the $-1'th node.

 Or if all you're trying to do is make a light wrapper around a built-in
 array (to, say, add an extra data member to it), then it means that your
 opIndex could be significantly less efficient than the built-in one.

I guess what I was thinking is... suppose I have an arbitrary, user-defined collection class: MyCollection!(int) c; and I want to dereference the last element. The source code might be: int n = c[$-1]; Now, currently, (or if $ gets defined as length), that would get turned into: int n = c.opIndex(c.length()-1); which /might/ be computationally intensive, depending on what MyCollection does with length() and opIndex(). The thing is, Walter seemed to imply that iterators were on their way, so what I was thinking was, if int n = c[$-1]; could instead be translated into int n = *(c.end()-1); then the computational overhead would vanish. I think we'd need something like opDereference() to overload the unary *. But, see, you'd never actually have to /write/ the *, because you'd still write "c[$-1]" - it's just what's going on under the hood. Similarly, I'm thinking the expression c[a,$-b] could turn into opRange(c.begin()+a, c.end()-b), instead of c.opIndex(a, c.length()-1). But it's all just pie in the sky right now, because I've got absolutely no idea what, if anything, Walter plans to do with collections and iterators.
Oct 20 2007
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Janice,

 On 10/20/07, Bill Baxter <dnewsgroup billbaxter.com> wrote:
 
 It becomes relevant if $ becomes a synonym for .length as was
 previously
 proposed.  In that case you don't get to decide how $ is implemented.
 It has to compute the length, even if that's O(N) and not really
 needed
 for computing the $-1'th node.
 Or if all you're trying to do is make a light wrapper around a
 built-in array (to, say, add an extra data member to it), then it
 means that your opIndex could be significantly less efficient than
 the built-in one.
 

user-defined collection class: MyCollection!(int) c; and I want to dereference the last element. The source code might be: int n = c[$-1]; Now, currently, (or if $ gets defined as length), that would get turned into: int n = c.opIndex(c.length()-1);

if the return of .length is left unrestricted, then you can play games with having it return a proxy type that another opIndex then uses.
Oct 20 2007
parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
BCS wrote:

 if the return of .length is left unrestricted, then you can play games 
 with having it return a proxy type that another opIndex then uses.

Yes, it doesn't have to be more complicated than so. That would handle the multidimensional case as well. Possibly let $ -> opDollar() instead of length(). Then, all we need is to make a..b turn into a special range type and depreciate opSlice. The remaining problem would be that while .length always explicitly refers to an object. $ only does that implicitly within index operations. A global _dollar() would solve that and allow: auto a = 1 .. $-1; auto b = $-5 .. $; auto c = myObj[a,b]; But that might be too much to ask for. -- Oskar
Oct 21 2007
prev sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/21/07, Oskar Linde <oskar.lindeREM ovegmail.com> wrote:
 But that might be too much to ask for.

I think those are /excellent/ ideas. So I assume opDollar would return some proxy type which represents "offset from end", and looks something like struct Dollar { int offset = 0; static Dollar opCall(int n) { Dollar d; d.offset=n; return d; } Dollar opAdd(int n) { return Dollar(offset+n); } Dollar opAdd_r(int n) { return Dollar(offset+n); } Dollar opSub(int n) { return Dollar(offset-n); } } and x..y returns a Slice!(typeof(x),typeof(y)) where struct Slice(T,U) { T x; U y; } (no need to implement opDotDot - that can happen automatically), and then opIndex knows to expect either an integer, a Dollar, or Slice whose elements are integers or Dollars. (More exotic uses of course can be imagined, but that would be enough to emulate current behavior). I like it!
Oct 21 2007
prev sibling next sibling parent reply 0ffh <spam frankhirsch.net> writes:
Walter Bright wrote:
 [see news://news.digitalmars.com:119/ffbr56$r40$1 digitalmars.com]
 What does this break?
 1) Passing dynamic arrays to printf as in:
     printf("my string is %*.s\n", str);

Hmmm, currently I use printf("foo %s\n",std.string.toStringz(bar)) anyway, never tried what you're doing there. Did I miss something?
 2) It breaks the internal library support code, but that's my problem.

Si.
 3) It breaks binary compatibility with libraries already compiled. But 
 we expect to break binary compatibility with D 2.0.

No sweat.
 4) It breaks things like cast(ulong)str, if one was crazy enough to do 
 that anyway.

Never needed it... =)
 5) It breaks anything that tries to look at the underlying 
 representation of dynamic arrays - but such code should be rewritten to 
 use .ptr and .length anyway, or slice notation.

Been a good boy and used .ptr and .length, so I'm fine with it... grin!
 So, what do you think?

I'm all for it! Regards, Frank
Oct 19 2007
parent reply Walter Bright <newshound1 digitalmars.com> writes:
0ffh wrote:
 Walter Bright wrote:
 [see news://news.digitalmars.com:119/ffbr56$r40$1 digitalmars.com]
 What does this break?
 1) Passing dynamic arrays to printf as in:
     printf("my string is %*.s\n", str);

Hmmm, currently I use printf("foo %s\n",std.string.toStringz(bar)) anyway, never tried what you're doing there. Did I miss something?

printf("my string is %*.s\n", str.length, str.ptr); will be a little faster, because it doesn't need to allocate memory.
Oct 19 2007
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter Bright wrote:

 1) Passing dynamic arrays to printf as in:
     printf("my string is %*.s\n", str);

Hmmm, currently I use printf("foo %s\n",std.string.toStringz(bar)) anyway, never tried what you're doing there. Did I miss something?

printf("my string is %*.s\n", str.length, str.ptr); will be a little faster, because it doesn't need to allocate memory.

When I try this in C, it gives a warning about the "length" argument: field precision should have type 'int', but argument 2 has type 'size_t' Wouldn't this warning apply to C too, when using the printf function ? I'm thinking it might need a cast(int) to work properly on 64-bit... ? --anders
Oct 20 2007
parent Walter Bright <newshound1 digitalmars.com> writes:
Anders F Björklund wrote:
 Walter Bright wrote:
 
 1) Passing dynamic arrays to printf as in:
     printf("my string is %*.s\n", str);

Hmmm, currently I use printf("foo %s\n",std.string.toStringz(bar)) anyway, never tried what you're doing there. Did I miss something?

printf("my string is %*.s\n", str.length, str.ptr); will be a little faster, because it doesn't need to allocate memory.

When I try this in C, it gives a warning about the "length" argument: field precision should have type 'int', but argument 2 has type 'size_t' Wouldn't this warning apply to C too, when using the printf function ? I'm thinking it might need a cast(int) to work properly on 64-bit... ?

I think I'm not going to think about it and just deprecate it <g>.
Oct 20 2007
prev sibling next sibling parent reply "Chris Miller" <chris dprogramming.com> writes:
On Fri, 19 Oct 2007 23:03:02 -0400, Walter Bright  
<newshound1 digitalmars.com> wrote:

 Currently, arrays are represented under the hood as:

 	size_t lengthOfArray;
 	void* ptrToStartOfArray;

 Which works out reasonably well. The problem is if you want to use array  
 types as the basis of iterators, and you want to step through the array.  
 There's no escaping it being two operations:

 	decrement the length
 	increment the pointer

 This puts a brick in any fast implementation of iterators. To fix that,  
 we can change the representation to:

 	void* ptrToStartOfArray;
 	void* ptrPastEndOfArray;

 Then there's just one increment. Some tests show this can improve loop  
 performance by up to 70%.

It sounds interesting... I wonder what kind of iterators you have in mind. Why couldn't the iterators just hold start and end pointers? It would only need to be calculated when creating an iterator from a non-iterator. Or perhaps you want a slice to be an iterator itself, so you could just do slice++ and it slices out the first element.
Oct 19 2007
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Chris Miller wrote:
 I wonder what kind of iterators you have in mind. Why couldn't the 
 iterators just hold start and end pointers? It would only need to be 
 calculated when creating an iterator from a non-iterator. Or perhaps you 
 want a slice to be an iterator itself, so you could just do slice++ and 
 it slices out the first element.

I think you answered it!
Oct 19 2007
parent "Chris Miller" <chris dprogramming.com> writes:
On Sat, 20 Oct 2007 02:27:29 -0400, Walter Bright  
<newshound1 digitalmars.com> wrote:

 Chris Miller wrote:
 I wonder what kind of iterators you have in mind. Why couldn't the  
 iterators just hold start and end pointers? It would only need to be  
 calculated when creating an iterator from a non-iterator. Or perhaps  
 you want a slice to be an iterator itself, so you could just do  
 slice++ and it slices out the first element.

I think you answered it!

I'm starting to think the change might not be such a good idea. Imagine all the existing code extensively using length. This change is for one thing, that could probably just as easily be improved by changing how that one thing is implemented, without affecting all this other code. It's not even really about supporting legacy usage; the other methods are probably mostly quite valid still, and iterators won't solve all those cases. Also, doing slice++ doesn't allow slice-- safely (not that all forward iterators are backwards iterators, but an iterator from a slice should be bidirectional). Slice iterators should probably have 3 pointers: beginning, end, and current position. Just build an iterator from a slice, which is a couple extra cycles up front. I really want to hear more about what's going on, instead of guessing how grand things are going to be, or not going to be. How about another newsgroup for possible future plans, like digitalmars.D.alpha
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Chris Miller <chris dprogramming.com> wrote:
 Or perhaps you
 want a slice to be an iterator itself, so you could just do slice++ and it
 slices out the first element.

It would be very counterintuitive if ("world"++ == "orld"), so I don't think I'd like that. In C++ terms, an interator is often just a pointer (plus some operator overloads), and an array of [ begin, end ] iterators is called a range. Looked at like that, an array slice in the new scheme would be an array of two iterators (i.e. a range), rather than a single iterator. Or so it seems to me.
Oct 20 2007
prev sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Janice Caron <caron800 googlemail.com> wrote:
 On 10/20/07, Chris Miller <chris dprogramming.com> wrote:
 Or perhaps you
 want a slice to be an iterator itself, so you could just do slice++ and it
 slices out the first element.

It would be very counterintuitive if ("world"++ == "orld"), so I don't think I'd like that.

To add to that, iterators in general are not range checked, because they are generalised pointers, not generalised indeces. You could argue that perhaps they /should/ be range checked (at least, in debug builds in D), but to achieve that, you'd need a struct containing not two, but /three/ pointers - one to the start of the slice, one to the end of the slice, and one to the thing pointed at. Otherwise you'd never be able to do --p.
Oct 20 2007
prev sibling next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
 1) Passing dynamic arrays to printf as in:

         printf("my string is %*.s\n", str);

This will break some of my old code, but none of my new code. What you could do to fix that is to break it even more (ironically). Make "printf" a reserved word, or an alias for writef, or /something/ which will cause any use of it to become a compile error.
 So, what do you think?

I had always assumed that, from the start, you designed it as { ptr, length } deliberately in order to help the garbage collector. If you change it, then you will be maintaining a pointer (the end pointer) to something beyond the end of the array. For example, suppose I do: int[] s = new int[4]; int[] t = new int[1000]; now t.begin might end up being equal to s.end. But now when we do t = null; or t goes out of scope or we otherwise assign t, suddenly that array can no longer be freed, because the end pointer of s points to it. (Not that I particularly care. Just pointing it out. Go for it).
Oct 19 2007
next sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Janice Caron wrote:

 I had always assumed that, from the start, you designed it as { ptr,
 length } deliberately in order to help the garbage collector.
 
 If you change it, then you will be maintaining a pointer (the end
 pointer) to something beyond the end of the array. For example,
 suppose I do:
 
     int[] s = new int[4];
     int[] t = new int[1000];
 
 now t.begin might end up being equal to s.end. 

All array allocations in D allocate an extra byte just to prevent this. But now when we do
 
     t = null;
 
 or t goes out of scope or we otherwise assign t, suddenly that array
 can no longer be freed, because the end pointer of s points to it.

-- Oskar
Oct 19 2007
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Janice Caron wrote:
 On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
 1) Passing dynamic arrays to printf as in:

         printf("my string is %*.s\n", str);

This will break some of my old code, but none of my new code. What you could do to fix that is to break it even more (ironically). Make "printf" a reserved word, or an alias for writef, or /something/ which will cause any use of it to become a compile error.

Just add 'deprecate' <g>.
 
 
 So, what do you think?

I had always assumed that, from the start, you designed it as { ptr, length } deliberately in order to help the garbage collector. If you change it, then you will be maintaining a pointer (the end pointer) to something beyond the end of the array. For example, suppose I do: int[] s = new int[4]; int[] t = new int[1000]; now t.begin might end up being equal to s.end. But now when we do t = null; or t goes out of scope or we otherwise assign t, suddenly that array can no longer be freed, because the end pointer of s points to it.

No prob, t=null already sets both members to 0.
 
 (Not that I particularly care. Just pointing it out. Go for it).

Oct 20 2007
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Janice Caron wrote:
 That said, someone posted that you always allocate an extra byte. If
 that's true, problem solved.

Yes, an extra byte is allocated for just this reason.
Oct 20 2007
parent reply Jascha Wetzel <firstname mainia.de> writes:
Walter Bright wrote:
 Janice Caron wrote:
 That said, someone posted that you always allocate an extra byte. If
 that's true, problem solved.

Yes, an extra byte is allocated for just this reason.

couldn't the end pointer just point to the last element? iteration could use <= the length would be end-start+1 etc.
Oct 20 2007
parent reply Jascha Wetzel <firstname mainia.de> writes:
Janice Caron wrote:
 On 10/20/07, Jascha Wetzel <firstname mainia.de> wrote:
 Walter Bright wrote:
 Janice Caron wrote:
 That said, someone posted that you always allocate an extra byte. If
 that's true, problem solved.



Oooh yuk. How horrible. I'd rather that didn't happen. For one thing, it means that empty arrays (and empty collection ranges generally) would have to have a different representation. No, I like the "standard" way of doing things, whereby the end iterator refences the last+1 element (and therefore cannot be dereferenced).

one can argue, that this way, empty arrays have an unambiguous representation (they are always null). there is enough code like if ( str is null || str == "" ) which wouldn't be necessary. the current semantics deal with null arrays in the same way as with empty arrays (length == 0, appending works alike). therefore, the additional expressiveness one would usually expect (empty array vs. invalid array) isn't quite there. but i still wonder, whether there would be performance issues.
Oct 20 2007
parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Jascha Wetzel wrote:
 Janice Caron wrote:
 On 10/20/07, Jascha Wetzel <firstname mainia.de> wrote:
 Walter Bright wrote:
 Janice Caron wrote:
 That said, someone posted that you always allocate an extra byte. If
 that's true, problem solved.



Oooh yuk. How horrible. I'd rather that didn't happen. For one thing, it means that empty arrays (and empty collection ranges generally) would have to have a different representation. No, I like the "standard" way of doing things, whereby the end iterator refences the last+1 element (and therefore cannot be dereferenced).

one can argue, that this way, empty arrays have an unambiguous representation (they are always null). there is enough code like if ( str is null || str == "" ) which wouldn't be necessary. the current semantics deal with null arrays in the same way as with empty arrays (length == 0, appending works alike).

Not quite true. if(str){} tests only the .ptr part and ignores the length (as I quite painfully discovered recently... Sure wish there were an easy way to find all the places in my code where I used that expression instead of .length==0.). --bb
Oct 20 2007
parent BCS <ao pathlink.com> writes:
Reply to Bill,

 Jascha Wetzel wrote:
 
 Janice Caron wrote:
 
 On 10/20/07, Jascha Wetzel <firstname mainia.de> wrote:
 
 Walter Bright wrote:
 
 Janice Caron wrote:
 
 That said, someone posted that you always allocate an extra byte.
 If that's true, problem solved.
 



thing, it means that empty arrays (and empty collection ranges generally) would have to have a different representation. No, I like the "standard" way of doing things, whereby the end iterator refences the last+1 element (and therefore cannot be dereferenced).

representation (they are always null). there is enough code like if ( str is null || str == "" ) which wouldn't be necessary. the current semantics deal with null arrays in the same way as with empty arrays (length == 0, appending works alike).

length (as I quite painfully discovered recently... Sure wish there were an easy way to find all the places in my code where I used that expression instead of .length==0.). --bb

Ohh. That's another feature for an IDE: Find this pattern in the syntax/semantic tree.
Oct 20 2007
prev sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter Bright wrote:

 This will break some of my old code, but none of my new code.

 What you could do to fix that is to break it even more (ironically).
 Make "printf" a reserved word, or an alias for writef, or /something/
 which will cause any use of it to become a compile error.

Just add 'deprecate' <g>.

Or maybe just move it from object.d and hello.d, over to std.c.stdio ? :-P --anders
Oct 20 2007
prev sibling next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Walter Bright wrote:
 Currently, arrays are represented under the hood as:
 
     size_t lengthOfArray;
     void* ptrToStartOfArray;
 
 Which works out reasonably well. The problem is if you want to use array 
 types as the basis of iterators, and you want to step through the array. 
 There's no escaping it being two operations:
 
     decrement the length
     increment the pointer

This has been nagging me too. Doing a = a[1..$]; has never felt very efficient. Will there be a different syntax for such idioms?
 What does this break?
 
 1) Passing dynamic arrays to printf as in:
 
     printf("my string is %*.s\n", str);

I'm glad that breaks. That code was never portable anyway, and has caused headaches porting D code to other platforms.
 4) It breaks things like cast(ulong)str, if one was crazy enough to do 
 that anyway.

This is something 64-bit D also breaks.
 So, what do you think?

It really shows the power of the property syntax that one can transparently change the underlying representation like this. I am all for it (as far as I can see now). Two related things that have been holding me off D2 are waiting for some announced code breaking changes that will have quite some impact on how code is written (not mentioning const): 1. T[new] - Hopefully T[*] (hint ;) ) - Hopefully a reference type 2. Static arrays becoming value types - Being able to return them from functions - No longer having to litter the code with incompatible wrapper structs Are the ETA for those still far away? -- Oskar
Oct 19 2007
next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Oskar Linde <oskar.lindeREM ovegmail.com> wrote:
 1) Passing dynamic arrays to printf as in:

     printf("my string is %*.s\n", str);

I'm glad that breaks. That code was never portable anyway

I'd be happy for it to break if the break could be detected at compile time (or even at runtime with an assert). I'm not terrifically happy that source code that previously generated code that ran just fine would now generate code that will simply crash with no explanation. You have to remember that there was once a time when writef() and family did not exist, and that printf("%.*s") was the /recommended/ way of doing things, documented as such on the Digital Mars site. So you cannot blame anyone for using that idiom, say, three or four years ago. As for not portable, I'm surprised at that. I thought the C printf() function was fairly standardised? Ah well, it doesn't matter. I'm sure everyone here agrees that D folk should not use printf anyway, now that we've got better things. I'd be all in favor making any use of printf a compile error - say by making a global function which takes no parameters (forcing coders to explicity qualify the function name if they wanted to C printf). Note that this reasoning also applies to vprintf, fprintf, vfprintf, sprintf, snprintf, etc., etc...
Oct 20 2007
next sibling parent Daniel Keep <daniel.keep.lists gmail.com> writes:
Janice Caron wrote:
 On 10/20/07, Oskar Linde <oskar.lindeREM ovegmail.com> wrote:
 1) Passing dynamic arrays to printf as in:

     printf("my string is %*.s\n", str);


I'd be happy for it to break if the break could be detected at compile time (or even at runtime with an assert). I'm not terrifically happy that source code that previously generated code that ran just fine would now generate code that will simply crash with no explanation.

You're calling a function that completely bypasses the type system, and you want compile-time checking on it? :S If you're going to play with fire (type-unsafe variadic C functions), you shouldn't be surprised when you get third-degree burns. ;)
 You have to remember that there was once a time when writef() and
 family did not exist, and that printf("%.*s") was the /recommended/
 way of doing things, documented as such on the Digital Mars site. So
 you cannot blame anyone for using that idiom, say, three or four years
 ago.

This is true, but remember that this is for D 2.0. If someone honestly expects to be able to take code written three years ago when the language was still in flux (and which, mind you, likely won't even compile with the current 1.0 compiler) and compile it on 2.0 without issues is, in my opinion, delusional.
 I'd be all in favor making any use of printf a compile error - say by
 making a global function which takes no parameters (forcing coders to
 explicity qualify the function name if they wanted to C printf). Note
 that this reasoning also applies to vprintf, fprintf, vfprintf,
 sprintf, snprintf, etc., etc...

I think that removing it from object so that it can't be used without explicitly importing it would be sufficient. Maybe sticking a deprecated attribute on the printf declaration would also help. Maybe with a message saying "No, you want writef. Seriously." -- Daniel
Oct 20 2007
prev sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Janice Caron wrote:
 On 10/20/07, Oskar Linde <oskar.lindeREM ovegmail.com> wrote:
 1) Passing dynamic arrays to printf as in:

     printf("my string is %*.s\n", str);


I'd be happy for it to break if the break could be detected at compile time (or even at runtime with an assert). I'm not terrifically happy that source code that previously generated code that ran just fine would now generate code that will simply crash with no explanation.

I guess there could be a warning (or error) passing a D array as a C vararg argument.
 You have to remember that there was once a time when writef() and
 family did not exist, and that printf("%.*s") was the /recommended/
 way of doing things, documented as such on the Digital Mars site. So
 you cannot blame anyone for using that idiom, say, three or four years
 ago.

I am well aware of that, and I dont blame anyone for using the idiom. I merely blame the recommendation. ;)
 As for not portable, I'm surprised at that. I thought the C printf()
 function was fairly standardised? Ah well, it doesn't matter. I'm sure
 everyone here agrees that D folk should not use printf anyway, now
 that we've got better things.

Printf is standardized. That is not the problem. The problem is that the above idiom is a hack relying on a specific stack layout when passing vararg arguments to C functions. -- Oskar
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Daniel Keep <daniel.keep.lists gmail.com> wrote:
 This is true, but remember that this is for D 2.0.  If someone honestly
 expects to be able to take code written three years ago when the
 language was still in flux (and which, mind you, likely won't even
 compile with the current 1.0 compiler) and compile it on 2.0 without
 issues is, in my opinion, delusional.

I would expect it to fail to compile, and I would also expect to be able to use the compiler errors (or in extreme cases, runtime asserts) to tell me what needs to be changed. What I don't expect is no complaints at all, just a runtime, hard to debug, crash. (That's especially important if you're using code you didn't write yourself).
 think that removing it from object so that it can't be used without
explicitly importing it would be sufficient.

 Maybe sticking a deprecated attribute on the printf declaration would
 also help.  Maybe with a message saying "No, you want writef.  Seriously."

Both are excellent ideas.
Oct 20 2007
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Oskar Linde wrote:
 Two related things that have been holding me off D2 are waiting for some 
 announced code breaking changes that will have quite some impact on how 
 code is written (not mentioning const):
 
 1. T[new]
  - Hopefully T[*] (hint ;) )
  - Hopefully a reference type
 
 2. Static arrays becoming value types
  - Being able to return them from functions
  - No longer having to litter the code with incompatible wrapper structs
 
 Are the ETA for those still far away?

Yes.
Oct 20 2007
prev sibling next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter Bright wrote:

 What does this break?
 
 1) Passing dynamic arrays to printf as in:
 
     printf("my string is %*.s\n", str);

I think you meant to write printf("my string is %.*s\n", str); Which is something of an argument against printf in itself :-)
 which relied on the under-the-hood representation. This doesn't work on 
 some architectures anyway, and is thoroughly obsolete. One could quickly 
 fix such code by writing it as:
 
     printf("my string is %*.s\n", str.length, str.ptr);

Yes, but that change had to be done sooner or later anyway... Same as with the cast from char[] to ulong, dirty hacks both.
 So, what do you think?

Good riddance. --anders
Oct 20 2007
parent Walter Bright <newshound1 digitalmars.com> writes:
Anders F Björklund wrote:
 Walter Bright wrote:
 
 What does this break?

 1) Passing dynamic arrays to printf as in:

     printf("my string is %*.s\n", str);

I think you meant to write printf("my string is %.*s\n", str); Which is something of an argument against printf in itself :-)

Yes.
 
 which relied on the under-the-hood representation. This doesn't work 
 on some architectures anyway, and is thoroughly obsolete. One could 
 quickly fix such code by writing it as:

     printf("my string is %*.s\n", str.length, str.ptr);

Yes, but that change had to be done sooner or later anyway... Same as with the cast from char[] to ulong, dirty hacks both.
 So, what do you think?

Good riddance.

LOL!
Oct 20 2007
prev sibling next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:

 2) Doesn't break array.length as rvalue, as this is rewritten by the
 compiler as (array.end - array.start).

That's pointer arithmetic, of course, so looked at in pure assembler terms, we're talking ((array.end - array.start) / element.sizeof) Is there any performance cost to the divide, or is optimised to be a right shift (at least where it can be)?
Oct 20 2007
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Janice Caron wrote:
 On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
 
 2) Doesn't break array.length as rvalue, as this is rewritten by the
 compiler as (array.end - array.start).

That's pointer arithmetic, of course, so looked at in pure assembler terms, we're talking ((array.end - array.start) / element.sizeof) Is there any performance cost to the divide, or is optimised to be a right shift (at least where it can be)?

Divides are expensive. It's mitigated by: 1) most of the time, it's 1 2) most of the rest of the time, it's a power of 2 and optimized to shift 3) some of the rest of the time, it can be skipped as it divides then multiples by the same amount.
Oct 20 2007
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Janice Caron wrote:
 On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
 
 2) Doesn't break array.length as rvalue, as this is rewritten by the
 compiler as (array.end - array.start).

That's pointer arithmetic, of course, so looked at in pure assembler terms, we're talking ((array.end - array.start) / element.sizeof) Is there any performance cost to the divide, or is optimised to be a right shift (at least where it can be)?

Divides are expensive. It's mitigated by: 1) most of the time, it's 1 2) most of the rest of the time, it's a power of 2 and optimized to shift 3) some of the rest of the time, it can be skipped as it divides then multiples by the same amount.
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
     int[] s = new int[4];
     int[] t = new int[1000];

 now t.begin might end up being equal to s.end. But now when we do

     t = null;

 or t goes out of scope or we otherwise assign t, suddenly that array
 can no longer be freed, because the end pointer of s points to it.

No prob, t=null already sets both members to 0.

Both members of t, yes, but it won't stop s.end from pointing to t's allocated memory. That said, someone posted that you always allocate an extra byte. If that's true, problem solved.
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
 What does this break?

 3) It breaks binary compatibility with libraries already compiled.

It also breaks binary compatibility with object files already compiled. I don't use dsss do I don't know how that will deal, but anyone using a makefile or some other build system needs to make sure that .obj files are dependent, not only on the correstponding .d file, but also on dmd.exe (or platform equivalent), so that the rebuild automatically updates the .obj file when dmd changes. That's not Walter's problem of course - it's just something of which developers need to be aware. Even if your build script does not have this dependency built in, a simple one-time "clean" build will fix all.
Oct 20 2007
prev sibling next sibling parent "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Sat, 20 Oct 2007 12:15:14 +0300, Janice Caron <caron800 googlemail.com>
wrote:

 On 10/20/07, Walter Bright <newshound1 digitalmars.com> wrote:
 What does this break?

 3) It breaks binary compatibility with libraries already compiled.

It also breaks binary compatibility with object files already compiled. I don't use dsss do I don't know how that will deal, but anyone using a makefile or some other build system needs to make sure that .obj files are dependent, not only on the correstponding .d file, but also on dmd.exe (or platform equivalent), so that the rebuild automatically updates the .obj file when dmd changes.

I guess that could be fixed by changing the name mangling for dynamic arrays. Then the users will be forced to rebuild their apps (unless they pass dynamic arrays hidden in void pointers or arrays). -- Best regards, Vladimir mailto:thecybershadow gmail.com
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Jascha Wetzel <firstname mainia.de> wrote:
 Walter Bright wrote:
 Janice Caron wrote:
 That said, someone posted that you always allocate an extra byte. If
 that's true, problem solved.

Yes, an extra byte is allocated for just this reason.

couldn't the end pointer just point to the last element?

Oooh yuk. How horrible. I'd rather that didn't happen. For one thing, it means that empty arrays (and empty collection ranges generally) would have to have a different representation. No, I like the "standard" way of doing things, whereby the end iterator refences the last+1 element (and therefore cannot be dereferenced).
Oct 20 2007
prev sibling next sibling parent Jascha Wetzel <firstname mainia.de> writes:
Walter Bright wrote:
 What does this break?

Since this is a different ABI, the debug symbols need to indicate which version is used. For CodeView i suggest using the language and/or version fields in the S_COMPILE symbol, which aren't used by DMD, atm.
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Jascha Wetzel <firstname mainia.de> wrote:
 one can argue, that this way, empty arrays have an unambiguous
 representation (they are always null).

I will argue below that this is not a good thing. Suppose I write a parser function which (say) parses an integer. It would have a signature like bool parseInt(ref string s, out int n) The idea is that if the function returns true, then the parsed int is returned in n, and s's pointer is advanced past the bytes consumed. However, I don't want s.ptr to be set to null, even if we consume all the bytes. For example, I might want to do a pointer subtraction later on. Or I might want to be able to extract the pointer because s was merely a slice of a larger string. For all sorts of reasons, allowing empty strings to have an address does make sense.
 there is enough code like
 if ( str is null || str == "" )

I always write if (str.length == 0) to test for empty strings (and arrays generally). That works just fine, always.
Oct 20 2007
prev sibling next sibling parent David Brown <dlang davidb.org> writes:
On Sat, Oct 20, 2007 at 01:16:54AM -0700, Walter Bright wrote:

 Divides are expensive. It's mitigated by:
 1) most of the time, it's 1
 2) most of the rest of the time, it's a power of 2 and optimized to shift
 3) some of the rest of the time, it can be skipped as it divides then
 multiples by the same amount.

At least with the gcc backend, divides by constants almost always turn into multiplies of strange constants, or weird tricks with shifts and subtracts/adds, depending on what the compiler things will be better. David
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Jascha Wetzel <firstname mainia.de> wrote:
 Yes, an extra byte is allocated for just this reason.

couldn't the end pointer just point to the last element? iteration could use <=

Ooh - just noticed this. The test is normally (p != end). You can't use <= with iterators in general because, if you did, the iterators would have to implement opCmp(), which would be an O(N) operation in the case of linked lists. != is O(1). With plain old arrays, for (p=start; p<=end; ++p) would fail with empty arrays if the empty array was represented as [ null, null ], because (null <= null) is true, so the for body would get executed once with p == null. (!) So there are /lots/ of reasons why end has to point /beyond/ the last byte.
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
Walter, I think we might want a new property, isEmpty, for arrays (and
collections in general).

See, we can't test for (array == null), because as you know, a
non-null array can still be empty.

The test (array.length == 0) will still work, but since calculating
.length might involve a divide, it's no longer guaranteed to be
optimal. (Also, I can imagine collection in which .length might be
O(N)).

We could test for (array.start == array.end). In fact, that's probably
the best approach, given what you've told us so far. But is there a
guarantee that that would necessarily be the most efficient test for
collections in general? I can imagine collections in which .end might
involve a non-trivial calculation.

.isEmpty would be clean, simple, and obvious.
Oct 20 2007
prev sibling next sibling parent Witold Baryluk <baryluk smp.if.uj.edu.pl> writes:
Dnia Fri, 19 Oct 2007 20:03:02 -0700
Walter Bright <newshound1 digitalmars.com> napisa=B3/a:

 Currently, arrays are represented under the hood as:
=20
 	size_t lengthOfArray;
 	void* ptrToStartOfArray;
=20
 Which works out reasonably well. The problem is if you want to use
 array types as the basis of iterators, and you want to step through
 the array. There's no escaping it being two operations:
=20
 	decrement the length
 	increment the pointer
=20
 This puts a brick in any fast implementation of iterators. To fix
 that, we can change the representation to:
=20
 	void* ptrToStartOfArray;
 	void* ptrPastEndOfArray;
=20
 Then there's just one increment. Some tests show this can improve
 loop performance by up to 70%.

What about garbage collector? And pointer that doesn't points to real data? If this will be fixed somehow and there will be strategy for moving collector to change this obscure second pointer to be proper, for new location of data, then it can be good choice. (especially if this change of iterator really will increse iteration, and wouldn't give regressions). --=20 Witold Baryluk, aleph0
Oct 20 2007
prev sibling next sibling parent David Brown <dlang davidb.org> writes:
On Sat, Oct 20, 2007 at 04:44:06PM +0100, Janice Caron wrote:

The test (array.length == 0) will still work, but since calculating
.length might involve a divide, it's no longer guaranteed to be
optimal. (Also, I can imagine collection in which .length might be
O(N)).

The compiler should never generate an actual division for a compile-time constant. For any constant division, there is an equivalent multiplication that will give the same result. Most modern architectures do multiply quickly. It is more work than strictly necessary to determine if the length is zero. However, (array.length == 0) is a real easy thing for an optimizer to see. It certainly doesn't justify adding a special construct to the language for it. It would be easier to just fix the optimizer if it misses it. David
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, David Brown <dlang davidb.org> wrote:
 On Sat, Oct 20, 2007 at 04:44:06PM +0100, Janice Caron wrote:

 For any constant division, there is an equivalent multiplication
 that will give the same result.

I hadn't thought about that before, but yes, you're right. The rule is, if two numbers a and b are relatively prime (that is to say: gcd(a,b)=1) there is exactly one number c=inv(a) mod b so that c*a mod b=1. So for what you suggest, b would have to be 2^32 (that is, uint.max+1) and a the number you want to divide by. In order to be coprime with b, a may /not/ be even, but that's not a problem because we can just do shifts until it isn't. So, for example, to divide by 7, you can just multiply by invmod(7,0x100000000). To divide by 14, you'd right-shift first, and then divide by 7. Cool! I wonder if D actually does things this way?
 However, (array.length == 0) is a real easy thing for an optimizer to see.
 It certainly doesn't justify adding a special construct to the language for
 it.

I was thinking ahead. I was assuming we'd want the same semantics for all collections, not just arrays. C++ STL has an empty() function as well as a size() function for exactly that reason. For some collections, calculating .length() will be an O(N) operation, so for those collections, an isEmpty() functions would make sense. I agree it's less necessary for a plain old array, but it would help template metaprogramming enormously if something like isEmpty was available, so you wouldn't need to know what kind of collection you were dealing with. That said, it's easy enough to add as a standalone function if it's not built in.
Oct 20 2007
prev sibling next sibling parent Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Currently, arrays are represented under the hood as:
 
     size_t lengthOfArray;
     void* ptrToStartOfArray;
 
 Which works out reasonably well. The problem is if you want to use array 
 types as the basis of iterators, and you want to step through the array. 
 There's no escaping it being two operations:
 
     decrement the length
     increment the pointer
 
 This puts a brick in any fast implementation of iterators. To fix that, 
 we can change the representation to:
 
     void* ptrToStartOfArray;
     void* ptrPastEndOfArray;
 
 Then there's just one increment. Some tests show this can improve loop 
 performance by up to 70%.
 
 So, what does this not break?
 
 1) Doesn't break array.ptr, this will still work.
 2) Doesn't break array.length as rvalue, as this is rewritten by the 
 compiler as (array.end - array.start).
 3) Doesn't break array.length as an lvalue, as that is handled by the 
 runtime library anyway.
 4) Won't break anything on D 1.0, as it wouldn't get this change.
 5) Won't break array slices, or any of that stuff we love about D arrays.
 
 What does this break?
 
 1) Passing dynamic arrays to printf as in:
 
     printf("my string is %*.s\n", str);
 
 which relied on the under-the-hood representation. This doesn't work on 
 some architectures anyway, and is thoroughly obsolete. One could quickly 
 fix such code by writing it as:
 
     printf("my string is %*.s\n", str.length, str.ptr);

This isn't a compelling reason to keep the current implementation. As Anders pointed out, it doesn't work on 64-bit anyway.
 2) It breaks the internal library support code, but that's my problem.

Yeah, no big deal here either.
 3) It breaks binary compatibility with libraries already compiled. But 
 we expect to break binary compatibility with D 2.0.

Right.
 4) It breaks things like cast(ulong)str, if one was crazy enough to do 
 that anyway.

Though I know the runtime does this, I've never seen any other code that does :-) And the runtime changes are easily fixed anyway. Heck, GDC already has this implemented.
 5) It breaks anything that tries to look at the underlying 
 representation of dynamic arrays - but such code should be rewritten to 
 use .ptr and .length anyway, or slice notation.

Not a compelling reason to keep the old representation, as this is really an implementation detail. My only potential concern would be the impact on slice operations and such, but addition and subtraction are so fast that this change shouldn't affect anything. Also, such operations are rarely, if ever, performed in tight loops, unlike iterator operations. vote++ Sean
Oct 20 2007
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 10/20/07, Bill Baxter <dnewsgroup billbaxter.com> wrote:
 Not quite true.  if(str){} tests only the .ptr part and ignores the
 length

Ooh, that's nasty. I've never tried that. Seems to me that either (a) if (array) shouldn't work - ie. arrays shouldn't implicitly cast to bool, or else (b) if (array) should be true iff the array is non-empty (which is not the same thing as non-null). I guess currently it tests whether or not the array is null - that is, unassigned or explicitly assigned with null. I'm not sure why that's useful.
Oct 20 2007
prev sibling next sibling parent Knud Soerensen <4tuu4k002 sneakemail.com> writes:
Walter Bright wrote:
 Then there's just one increment. Some tests show this can improve loop
 performance by up to 70%.
 

I say go for it! I think it is important to break code as soon as possible. Waiting to break code will only result in more code being broken at a later time.
Oct 20 2007
prev sibling next sibling parent David Brown <dlang davidb.org> writes:
On Sat, Oct 20, 2007 at 06:12:07PM +0100, Janice Caron wrote:

I wonder if D actually does things this way?

The gcc backend certainly does. I don't know about dmd. I don't have it handy to check, but it wouldn't be hard to disassemble the output and look. David
Oct 20 2007
prev sibling next sibling parent Derek Parnell <derek psych.ward> writes:
On Fri, 19 Oct 2007 20:03:02 -0700, Walter Bright wrote:
 So, what do you think?

My first concern was for the speed of accessing the length, either through .length property or the $ concept. I've examined my usage of .length and about 80% of the time I'm comparing it to zero ... if (X.length == 0) if (X.length > 0) About another 10% is comparing the length of one array to another array ... if (X.length == Y.length) Most of the rest is extracting right hand slices (substrings typically) ... result = X[$ - a .. $ - b] Only occasionally do I set the .length Most of my array iterations use the foreach() construct. So it seems that it would be beneficial change for me. It certainly won't break any of my code because I don't try to access the ABI directly, only through .ptr and .length. My test for an empty array is to compare the length. My test for an unassigned array is to test the .ptr value to null. So I think that a good built-in property would be an .isEmpty concept. -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Oct 21 2007
prev sibling next sibling parent Don Clugston <dac nospam.com.au> writes:
Walter Bright wrote:
 Currently, arrays are represented under the hood as:
 
     size_t lengthOfArray;
     void* ptrToStartOfArray;
 
 Which works out reasonably well. The problem is if you want to use array 
 types as the basis of iterators, and you want to step through the array. 
 There's no escaping it being two operations:
 
     decrement the length
     increment the pointer
 
 This puts a brick in any fast implementation of iterators. To fix that, 
 we can change the representation to:
 
     void* ptrToStartOfArray;
     void* ptrPastEndOfArray;

This seems to be a more powerful construction than .ptr + .length. If you know the length, it's usually trivial to calculate the end, but the converse is not necessarily true. My first reaction is, does it generalise to multi-dimensional arrays? (I'm hoping we'll still get them eventually, though not necessarily as built-in types). Haven't yet thought through it properly, but I think it's probably _better_. It means that a 'strided slice' (eg, every 4th element) could be implemented without mucking around with the length element. It's easy to test if two such arrays overlap. Many operations can be done entirely at compile time, simply using a cast. votes++; I'm less sure about the use of slices as iterators though.
Oct 21 2007
prev sibling next sibling parent reply Graham St Jack <grahams acres.com.au> writes:
If we are considering changing the internal representation of dynamic 
arrays, maybe we should also consider adding a pointer to "end of 
allocation", making it three pointers in total.

The extra pointer could be used by the concatenation operators to 
dramatically reduce the number of reallocs that occur when doing ~= 
appends to a dynamic array - by only realloc-ing when capacity runs out, 
and allocating more capacity than immediately needed.

I currently have to use a more heavy-weight container class (like a 
Vector) if I don't want to pay the performance cost of repeated 
reallocs, and it seems to me that cranking up the "out of the box" 
performance of dynamic arrays when concatenating would be good.
Oct 21 2007
parent reply David Brown <dlang davidb.org> writes:
On Mon, Oct 22, 2007 at 09:55:12AM +0930, Graham St Jack wrote:

 The extra pointer could be used by the concatenation operators to 
 dramatically reduce the number of reallocs that occur when doing ~= appends 
 to a dynamic array - by only realloc-ing when capacity runs out, and 
 allocating more capacity than immediately needed.

D already makes use of this information in the GC. In fact, allocations are always powers of 2, up until the page size, and then in full pages. Try the following: import std.stdio; import std.gc; void main () { char[] text; for (int i = 0; i < 256; i++) { text ~= 'a'; writefln ("len = %d, alen = %d", text.length, capacity (text.ptr)); } }
Oct 21 2007
parent Graham St Jack <grahams acres.com.au> writes:
David Brown wrote:
 On Mon, Oct 22, 2007 at 09:55:12AM +0930, Graham St Jack wrote:
 
 The extra pointer could be used by the concatenation operators to 
 dramatically reduce the number of reallocs that occur when doing ~= 
 appends to a dynamic array - by only realloc-ing when capacity runs 
 out, and allocating more capacity than immediately needed.

D already makes use of this information in the GC. In fact, allocations are always powers of 2, up until the page size, and then in full pages. Try the following: import std.stdio; import std.gc; void main () { char[] text; for (int i = 0; i < 256; i++) { text ~= 'a'; writefln ("len = %d, alen = %d", text.length, capacity (text.ptr)); } }

Cool - thanks.
Oct 21 2007
prev sibling next sibling parent "Kris" <foo bar.com> writes:
Can anyone explain why the compiler/iterators/foreach would insist on using 
the two operations Walter notes below (decrement the length, increment the 
pointer) instead of first converting to a pointer pair?

Seem like this is a concern that has more than one way to approach it?

- kris



"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:ffbr56$r40$1 digitalmars.com...
 Currently, arrays are represented under the hood as:

 size_t lengthOfArray;
 void* ptrToStartOfArray;

 Which works out reasonably well. The problem is if you want to use array 
 types as the basis of iterators, and you want to step through the array. 
 There's no escaping it being two operations:

 decrement the length
 increment the pointer

 This puts a brick in any fast implementation of iterators. To fix that, we 
 can change the representation to:

 void* ptrToStartOfArray;
 void* ptrPastEndOfArray;

 Then there's just one increment. Some tests show this can improve loop 
 performance by up to 70%.

 So, what does this not break?

 1) Doesn't break array.ptr, this will still work.
 2) Doesn't break array.length as rvalue, as this is rewritten by the 
 compiler as (array.end - array.start).
 3) Doesn't break array.length as an lvalue, as that is handled by the 
 runtime library anyway.
 4) Won't break anything on D 1.0, as it wouldn't get this change.
 5) Won't break array slices, or any of that stuff we love about D arrays.

 What does this break?

 1) Passing dynamic arrays to printf as in:

 printf("my string is %*.s\n", str);

 which relied on the under-the-hood representation. This doesn't work on 
 some architectures anyway, and is thoroughly obsolete. One could quickly 
 fix such code by writing it as:

 printf("my string is %*.s\n", str.length, str.ptr);

 2) It breaks the internal library support code, but that's my problem.

 3) It breaks binary compatibility with libraries already compiled. But we 
 expect to break binary compatibility with D 2.0.

 4) It breaks things like cast(ulong)str, if one was crazy enough to do 
 that anyway.

 5) It breaks anything that tries to look at the underlying representation 
 of dynamic arrays - but such code should be rewritten to use .ptr and 
 .length anyway, or slice notation.

 So, what do you think? 

Oct 21 2007
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Walter Bright" wrote
 2) Doesn't break array.length as rvalue, as this is rewritten by the 
 compiler as (array.end - array.start).

Yes, but checking the length (which is something I do often in my code) now becomes an operation rather than a memory load. So you are swapping a performance hit in one place with one in another.
 What does this break?

 1) Passing dynamic arrays to printf as in:

 printf("my string is %*.s\n", str);

Absolutely could not care less :)
 So, what do you think?

I think this change is unnecessary for iterators, why do we need to change the array representation? An iterator is a pointer in a sequence of values. It does not need to know where in the sequence of values it is, so why would you represent an iterator with 2 pointers? If you make the case that it's for bounds checking, then you really need 3 pointers to check for reverse iterating past the beginning. I'll also point out that you are now adding the cost of copying two values when passing an iterator around instead of one pointer, so using pointers will be faster anyways. Why not have an iterator type, defined as the property of an array, i.e.: char[].iterator is a type, which is aliased to char *. Then you could have properties like: char[].iterator begin() { return ptr} char[].iterator end() { return ptr + length} This is very similar to C++, and makes the most sense to me, and provides the most speed out of anything. You could do something similar with AA's, but the type would have to be a struct probably. It wouldn't even pollute the namespace/require keywords because they are properties of the arrays themselves. -Steve
Oct 22 2007
parent "Kris" <foo bar.com> writes:
Yeah, I'm with Steve on this. There's more than one way to skin this cat, 
and have solid performance all round instead of potentially trading one for 
another.


"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:ffij0u$290f$1 digitalmars.com...
 "Walter Bright" wrote
 2) Doesn't break array.length as rvalue, as this is rewritten by the 
 compiler as (array.end - array.start).

Yes, but checking the length (which is something I do often in my code) now becomes an operation rather than a memory load. So you are swapping a performance hit in one place with one in another.
 What does this break?

 1) Passing dynamic arrays to printf as in:

 printf("my string is %*.s\n", str);

Absolutely could not care less :)
 So, what do you think?

I think this change is unnecessary for iterators, why do we need to change the array representation? An iterator is a pointer in a sequence of values. It does not need to know where in the sequence of values it is, so why would you represent an iterator with 2 pointers? If you make the case that it's for bounds checking, then you really need 3 pointers to check for reverse iterating past the beginning. I'll also point out that you are now adding the cost of copying two values when passing an iterator around instead of one pointer, so using pointers will be faster anyways. Why not have an iterator type, defined as the property of an array, i.e.: char[].iterator is a type, which is aliased to char *. Then you could have properties like: char[].iterator begin() { return ptr} char[].iterator end() { return ptr + length} This is very similar to C++, and makes the most sense to me, and provides the most speed out of anything. You could do something similar with AA's, but the type would have to be a struct probably. It wouldn't even pollute the namespace/require keywords because they are properties of the arrays themselves. -Steve

Oct 22 2007