www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Array append performance

reply bearophile <bearophileHUGS lycos.com> writes:
I have performed another small benchmark of the dynamic array append, similar
to one I have performed in the past, a comparison between the C++ push_back of
C++ and the ~= of D, both performed after a "reserve".

To perform the "reserve" in D I have used std.gc.malloc instead of the simpler:
auto a = new int[n];
a.length = 0;
to actually speed up the D code a little, avoiding a (fast but useless) memory
cleaning.

I have used MinGW 4.2.1 and DMD v.1.034.

Compiling the C++ with:
-O3 -s
and the D with:
-O -release -inline

The following results are with n = 100_000_000 (about 400 MB RAM used):
  D:   7.94 s
  C++: 0.60 s  (13.2X)

Note that if n is 10 or even 100 times smaller the situation doesn't change
much:
n = 10_000_000:
  D:   0.83 s
  C++: 0.09 s

Note2: In the D code the final deallocation of the array is fast enough, and
its allocation is fast too (you can see this timing just the for loop in the
middle of the program). The really slow part seems to be the loop itself.

Note3: If a reserve isn't used (that is in both programs the arrays aren't
pre-allocated) the difference in performance becomes smaller, about 8.6X with n
= 100_000_000).

This is the asm generated by g++:
http://codepad.org/hXNDH0se

Bye,
bearophile

---------------------------
C++ code:

#include "stdio.h"
#include <vector>

using namespace std;

int main(int argc, char **argv) {
    int n = argc >= 2 ? atoi(argv[1]) : 1000;

    vector<int> a;
    a.reserve(n);
    for (int i = 0; i < n; ++i)
        a.push_back(i);

    printf("%d\n", a[n-1]);

    return 0;
}

---------------------------
D code:

import std.conv: toInt;
import std.gc: malloc, hasNoPointers;

int main(string[] args) {
    int n = args.length >= 2 ? toInt(args[1]) : 1000;

    int* aux_ptr = cast(int*)malloc(n * int.sizeof);
    hasNoPointers(aux_ptr);
    int[] a = aux_ptr[0 .. 0];

    for (int i; i < n; i++)
        a ~= i;

    printf("%d\n", a[n - 1]);
    return 0;
}
Aug 22 2008
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
bearophile wrote:
 I have performed another small benchmark of the dynamic array append, similar
to one I have performed in the past, a comparison between the C++ push_back of
C++ and the ~= of D, both performed after a "reserve".
 
 To perform the "reserve" in D I have used std.gc.malloc instead of the simpler:
 auto a = new int[n];
 a.length = 0;
 to actually speed up the D code a little, avoiding a (fast but useless) memory
cleaning.
 
 I have used MinGW 4.2.1 and DMD v.1.034.
 
 Compiling the C++ with:
 -O3 -s
 and the D with:
 -O -release -inline
 
 The following results are with n = 100_000_000 (about 400 MB RAM used):
   D:   7.94 s
   C++: 0.60 s  (13.2X)
 
 Note that if n is 10 or even 100 times smaller the situation doesn't change
much:
 n = 10_000_000:
   D:   0.83 s
   C++: 0.09 s
Unfortunately, this is expected. On every append the runtime checks to see if there is enough room in the buffer for the new data, which requires a call to gc.capacity(). So best case you're paying for a bunch of math and worst case you have to acquire the GC mutex as well. That said, this isn't really an entirely fair comparison, since you're comparing the performance of a structured, local user-level object vs. a minimal, language-level feature. I know that D arrays are supposed to be roughly comparable to std::vector or std::string, and they are from a semantics perspective, but when the only local data they contain is a pointer and a length there's little that can be done to optimize certain behavior. And you've just happened to pinpoint the greatest performance problem with D arrays :-) Sean
Aug 22 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
I just ran into this yesterday porting C++ code that used push_back() heavily to
D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
actually faster than the C++ code, but the inner loops are less readable.  For a
whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
capacity field in the array itself?  I think the problem here is the need to
call
gc.capacity(), which can require a lock, as Sean said.  If you have the capacity
field embedded in the array itself, you never need a lock for ~= unless the
array
actually needs to be realloced or is appended to by more than one thread.  I've
written classes before that need fast appending and therefore cache the capacity
value every time the array is reallocated, and it makes it a ton faster even in
single threaded code.  On the other hand, this breaks the ABI for D arrays, but
D2
breaks a lot of backwards compatibility anyhow, so maybe this is still a good
idea.  Does anyone actually write C functions that rely on this ABI and then
call
them from D?  Furthermore, having a capacity field would also allow reserve for
builtin arrays.

Other than the 4 bytes of extra overhead, does anyone see any downside to
including a capacity field?  Also, is there some situation I'm not thinking of
where the 4 bytes actually matter more than the speed gains in question?
Aug 22 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
First of all thank you very much Sean Kelly for your clear explanation of the
situation.

dsimcha:
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
In 2D/nD arrays you repeat such size_t many times, so it's not just 4/8 bytes. (This particular problem can be solved with rectangular/equal length dynamic arrays, but it may be too much for a built-in type...). Are dynamic array appends common? I think they may be used often enough to justify such memory increase. I think an experimental approach here may be the best way to answer such question. I mean creating a branch version of DMD where dynamic arrays have a 3-item long defining struct (12/24) and use it to benchmark some real programs or benchmarks, to see how the performance in memory/speed of the compiler changes. How much time/work does it require to create a DMD with such change? Note that a similar branched DMD can be useful to test the performance of another change Walter was talking regarding array structs: changing the len-ptr struct of the arrays to a startptr-stopptr struct. Bye, bearophile
Aug 22 2008
parent "Robert Jacques" <sandford jhu.edu> writes:
On Fri, 22 Aug 2008 22:23:04 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:
 Note that a similar branched DMD can be useful to test the performance  
 of another change Walter was talking regarding array structs: changing  
 the len-ptr struct of the arrays to a startptr-stopptr struct.
On a similar note, there's also fast strided arrays. ( see my blog http://dobbscodetalk.com/index.php?option=com_myblog&show=Making-strided-arrays-as-fast-as-native-a rays.html&Itemid=29 )
Aug 22 2008
prev sibling next sibling parent reply Jerry Quinn <jlquinn optonline.net> writes:
dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
 
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be making slice operations more difficult to implement correctly. That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often. Otherwise, it noticably restricts the places I can use built-in arrays. An alternative might be a convenient simple library wrapper that provides these kinds of dynamic operations, while still allowing easy access to the underlying array. I'm thinking prehaps std.array might include an appender object that you create, append to, and then grab the underlying array from, once you're done using it. It's not hard to write, but I'm sure it would be a welcome addition to the library, even if core arrays don't give us this. BTW, std.array seems to be missing documentation.
Aug 22 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Jerry Quinn:
I suspect a more important reason might be making slice operations more
difficult to implement correctly.<
Can you explain why? Bye, bearophile
Aug 22 2008
prev sibling parent reply superdan <super dan.org> writes:
Jerry Quinn Wrote:

 dsimcha Wrote:
 
 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
 
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be making slice operations more difficult to implement correctly. That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often. Otherwise, it noticably restricts the places I can use built-in arrays.
sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field in the array are off base. today each slice has length and data. why. because all a slice is *is* length and data. it's a view of a portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. can't be a field just before the actual data in the slice either. for obvious reasons.
 An alternative might be a convenient simple library wrapper that provides
these kinds of dynamic operations, while still allowing easy access to the
underlying array.  I'm thinking prehaps std.array might include an appender
object that you create, append to, and then grab the underlying array from,
once you're done using it.  It's not hard to write, but I'm sure it would be a
welcome addition to the library, even if core arrays don't give us this.
that makes more sense. a sort of in-memory stream thing.
Aug 22 2008
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from superdan (super dan.org)'s article
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. A capacity field would work fine with slicing if, whenever you take a slice into an array, the capacity field of the slice is set to zero implicitly, even though the length is longer than zero. This is in effect how it works now. Try calling std.gc.capacity() on a pointer to a slice. It will always return 0 b/c a slice doesn't point to the start of a block. This means that if you try to increase the size of a slice it will always be reallocated. Simply setting the capacity field of slices to 0 when they are created would replicate this in arrays w/ a capacity field.
Aug 22 2008
parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from superdan (super dan.org)'s article
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. A capacity field would work fine with slicing if, whenever you take a slice into an array, the capacity field of the slice is set to zero implicitly, even though the length is longer than zero. This is in effect how it works now. Try calling std.gc.capacity() on a pointer to a slice. It will always return 0 b/c a slice doesn't point to the start of a block. This means that if you try to increase the size of a slice it will always be reallocated. Simply setting the capacity field of slices to 0 when they are created would replicate this in arrays w/ a capacity field.
The Tango runtime is a bit different in that it checks to see if the array is a slice explicitly--it seems safer than relying on the GC to always return a zero capacity for an interior pointer. Sean
Aug 23 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be
making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.
Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in
the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a
portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad. Sean
Aug 23 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sean Kelly" wrote
 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() 
 heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code 
 is now
 actually faster than the C++ code, but the inner loops are less 
 readable.  For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not 
 include the

 Other than the 4 bytes of extra overhead, does anyone see any 
 downside to
 including a capacity field?  Also, is there some situation I'm not 
 thinking of
 where the 4 bytes actually matter more than the speed gains in 
 question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be
making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do 
 iterative appends to arrays fairly often.
Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who 
 want to make capacity a field in
the array are off base.
 today each slice has length and data. why. because all a slice is *is* 
 length and data. it's a view of a
portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Unless the slice is at the beginning of an allocated block... -Steve
Aug 23 2008
parent Sean Kelly <sean invisibleduck.org> writes:
Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() 
 heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code 
 is now
 actually faster than the C++ code, but the inner loops are less 
 readable.  For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not 
 include the

 Other than the 4 bytes of extra overhead, does anyone see any 
 downside to
 including a capacity field?  Also, is there some situation I'm not 
 thinking of
 where the 4 bytes actually matter more than the speed gains in 
 question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be
making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do 
 iterative appends to arrays fairly often.
Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who 
 want to make capacity a field in
the array are off base.
 today each slice has length and data. why. because all a slice is *is* 
 length and data. it's a view of a
portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Unless the slice is at the beginning of an allocated block...
True enough :-) I suppose this is one argument in favor of a capacity field. Sean
Aug 25 2008
prev sibling parent reply superdan <super dan.org> writes:
Sean Kelly Wrote:

 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be
making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.
Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in
the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a
portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
then we have a field hanging in there that's only valid sometimes. that won't quite work. will it be copyable? like when you copy a slice to another. what happens with the capacity field? at any rate. i find it amused that aside from u nobody even missed a beat on the discussion on adding the capacity. this is pulp fiction alright. a: "how about them capacity fields. what if we add them." b: "would be cool." c: "would suck goat balls." d: "capacity don't make sense because this and that and the other." a, b, c: "yeah, but we're still contemplating the ifs." wake up people. there is first the problem capacity can't be implemented. u gotta figure out some approximation first. then discuss the gorram ifs.
Aug 24 2008
next sibling parent Christopher Wright <dhasenan gmail.com> writes:
superdan wrote:
 wake up people. there is first the problem capacity can't be implemented. u
gotta figure out some approximation first. then discuss the gorram ifs.
Capacity could be implemented by preallocating extra space at the end of an array, but aside from that, you're right; the garbage collector could use memory between the end of the array and the end of its capacity after the array was created, and there's no way you could update all the references to that array and change the stored capacity. As for slices, the capacity of a slice would be equal to its length, so that isn't an issue.
Aug 24 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
superdan wrote:
 Sean Kelly Wrote:
 
 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be
making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.
Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in
the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a
portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
then we have a field hanging in there that's only valid sometimes. that won't quite work. will it be copyable? like when you copy a slice to another. what happens with the capacity field?
That's the weird bit. The capacity of any copy of a slice or an array should be the length of the copy, so copying couldn't be done via memcpy on the array ref. More simply, you could probably just default the capacity to zero, but I don't like the idea of violating the logical invariant for array refs just to make copying / initialization easier. And the weirdness of this suggests to me (and to you as well I presume) that perhaps adding a capacity field isn't proper for built-in arrays. Sean
Aug 25 2008
parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 25 Aug 2008 22:25:42 +0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 superdan wrote:
 Sean Kelly Wrote:

 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used  
 push_back() heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code  
 is now
 actually faster than the C++ code, but the inner loops are less  
 readable.  For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not  
 include the

 Other than the 4 bytes of extra overhead, does anyone see any  
 downside to
 including a capacity field?  Also, is there some situation I'm not  
 thinking of
 where the 4 bytes actually matter more than the speed gains in  
 question?
For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be
making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do  
 iterative appends to arrays fairly often.
Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who  
 want to make capacity a field in
the array are off base.
 today each slice has length and data. why. because all a slice is  
 *is* length and data. it's a view of a
portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
then we have a field hanging in there that's only valid sometimes. that won't quite work. will it be copyable? like when you copy a slice to another. what happens with the capacity field?
That's the weird bit. The capacity of any copy of a slice or an array should be the length of the copy, so copying couldn't be done via memcpy on the array ref. More simply, you could probably just default the capacity to zero, but I don't like the idea of violating the logical invariant for array refs just to make copying / initialization easier. And the weirdness of this suggests to me (and to you as well I presume) that perhaps adding a capacity field isn't proper for built-in arrays. Sean
why they both have StringBuilders (which has a capacity field). The same can be applied to other array types, a library ArrayBuilder class is a good solution, I think.
Aug 25 2008
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"dsimcha" wrote
I just ran into this yesterday porting C++ code that used push_back() 
heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable. 
 For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include 
 the
 capacity field in the array itself?  I think the problem here is the need 
 to call
 gc.capacity(), which can require a lock, as Sean said.  If you have the 
 capacity
 field embedded in the array itself, you never need a lock for ~= unless 
 the array
 actually needs to be realloced or is appended to by more than one thread. 
 I've
 written classes before that need fast appending and therefore cache the 
 capacity
 value every time the array is reallocated, and it makes it a ton faster 
 even in
 single threaded code.  On the other hand, this breaks the ABI for D 
 arrays, but D2
 breaks a lot of backwards compatibility anyhow, so maybe this is still a 
 good
 idea.  Does anyone actually write C functions that rely on this ABI and 
 then call
 them from D?  Furthermore, having a capacity field would also allow 
 reserve for
 builtin arrays.

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not 
 thinking of
 where the 4 bytes actually matter more than the speed gains in question?
I see nothing wrong with having a library solution to this. D and especially D2 is to the point where syntax of custom structs is really easy to use. The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)? The only thing missing right now is opImplicitCast (to go back and forth from a normal array). -Steve
Aug 23 2008
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 I see nothing wrong with having a library solution to this.  D and
 especially D2 is to the point where syntax of custom structs is really easy
 to use.
 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every slice
 and array that isn't going to change size (which I would argue is used more
 often)?
 The only thing missing right now is opImplicitCast (to go back and forth
 from a normal array).
 -Steve
You do have a point there. Builtin arrays work well for the typical cases, but no one array type could work well for every conceivable specialized case, especially if you count the case of needing minimal overhead as one possible specialized case. I've wanted to write my own specialized array classes to deal with these unusual cases for a while, but the lack of opImplicitCast is the only thing that prevents me from doing so, since I would want them to be able to play nice with builtin arrays. A little off topic, but somewhat relevant: I wrote a scope array class for when you need a to allocate large temporary array for something like a mathematical computation or a merge sort, without constantly triggering GC or running into false pointer issues. It bypasses the GC, calls C's malloc, realloc and free, and, since it's a scope class, the memory is freed by the dtor. Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?
Aug 23 2008
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Sat, Aug 23, 2008 at 05:50:08PM +0000, dsimcha wrote:
 Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?
I wouldn't rely on it. Those are just interfaces to the platform's native C library, which may or may not be thread safe - the C spec doesn't say anything about threads at all (if I remember correctly). -- Adam D. Ruppe http://arsdnet.net
Aug 23 2008
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Adam D. Ruppe" wrote
 On Sat, Aug 23, 2008 at 05:50:08PM +0000, dsimcha wrote:
 Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are 
 threadsafe?
I wouldn't rely on it. Those are just interfaces to the platform's native C library, which may or may not be thread safe - the C spec doesn't say anything about threads at all (if I remember correctly).
Although the spec may not say it, any modern OS has a thread safe malloc. You should have no problems. -Steve
Aug 23 2008
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:
 I see nothing wrong with having a library solution to this. [...]
 The problem is that many times I don't append to an array or slice, why 
 should I have to accept the cost of an extra 4-8 byte field for every slice 
 and array that isn't going to change size (which I would argue is used more 
 often)?
Probably a library solution makes you use even more memory (when you want to use it), and makes dynamic arrays less handy to use in D (but it may be better than nothing). Array append is an operation rather common in higher-level programs, for example in Python it's one of the most common operations, while the up-front creation of empty arrays is quite uncommon. I don't know how much common the push_back is in C++ programs. (In successive benchmarks I have seen that array append of integers with Psyco is about as fast as appending integers in D. A difference is that in Python integers are objects, around 20 bytes long). My idea was to do this change in a D branch, and see how programs like your ones behave in the modified version: if your code at runtime has to use 100-200 bytes more than before, then I think this price may be accepted on modern PCs. On the other hand a large number of such size_t fields may be wasted when you manage tons of short strings, or when you manage many smallish nD arrays. That problem with nD arrays can be solved with a library that manages rectangular nD arrays, that avoids wasting of space for the lengths of each row/etc. This means that another solution is to do the opposite thing, instead of adding a library support for efficiently extensible arrays, such support may be of dynamically allocated fixed-size arrays... :-) Bye, bearophile
Aug 23 2008
prev sibling next sibling parent reply "Bill Baxter" <wbaxter gmail.com> writes:
On Sun, Aug 24, 2008 at 2:16 AM, Steven Schveighoffer
<schveiguy yahoo.com> wrote:
 "dsimcha" wrote
 I see nothing wrong with having a library solution to this.  D and
 especially D2 is to the point where syntax of custom structs is really easy
 to use.

 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every slice
 and array that isn't going to change size (which I would argue is used more
 often)?
Personally I'd argue that appending efficiently is important to more programs than than saving 4-8 bytes here and there. So that would suggest that lean-and-mean fixed-sized arrays should be the library solution, and arrays with a capacity field should be the built-in type. --bb
Aug 23 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Bill Baxter" wrote
 On Sun, Aug 24, 2008 at 2:16 AM, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:
 "dsimcha" wrote
 I see nothing wrong with having a library solution to this.  D and
 especially D2 is to the point where syntax of custom structs is really 
 easy
 to use.

 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every 
 slice
 and array that isn't going to change size (which I would argue is used 
 more
 often)?
Personally I'd argue that appending efficiently is important to more programs than than saving 4-8 bytes here and there. So that would suggest that lean-and-mean fixed-sized arrays should be the library solution, and arrays with a capacity field should be the built-in type.
Making all arrays (even ones that you create temporarily and never append to) have extra data to copy will add up performance-wise. I think there are only a few spots in any of my code that I use the append operator. So 'important' it is, but common, I would argue it is not. Of course, I only present anectodal evidence :) There's also the advantage of backwards compatibility, and not having to change the D spec. -Steve
Aug 23 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:
 Making all arrays (even ones that you create temporarily and never append 
 to) have extra data to copy will add up performance-wise.
Benchmarks of new features help estimate such performance tradoffs in real programs.
 I think there are 
 only a few spots in any of my code that I use the append operator.  So 
 'important' it is, but common, I would argue it is not.  Of course, I only 
 present anectodal evidence :)
In the Python world they solve this problem taking a look at real code. So some projects in the dsource repository can be mined to estimate how much often the appending is used in real programs. Note that a std lib library is meant to be very efficient, because people use it as a source of building blocks to write normal programs, so I presume Tango/Phobos will have less array appends than the average user-level code, so they may be excluded from such database.
 There's also the advantage of backwards compatibility, and not having to 
 change the D spec.
It's probably meant as a change for D 2 only. Bye, bearophile
Aug 23 2008
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Bill Baxter (wbaxter gmail.com)'s article
 Personally I'd argue that appending efficiently is important to more
 programs than than saving 4-8 bytes here and there.  So that would
 suggest that lean-and-mean fixed-sized arrays should be the library
 solution, and arrays with a capacity field should be the built-in
 type.
 --bb
Definitely a reasonable argument, but just to play devil's advocate, let's assume that opImplicitCast is coming eventually. Most of the library implementations of arrays will presumably be implicitly castable to builtins, and library functions, especially standard lib functions, will probably use the builtin type as arguments. Implicit casting to builtin, in my view, is essential to using library implementations of arrays, because otherwise you have a mess of a zillion different incompatible array library implementations a la C++. If you have to create a dummy capacity field every time you implicitly cast your array to a builtin anyhow, that might defeat the purpose of a lightweight library implementation at times.
Aug 23 2008
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 23 Aug 2008 13:16:26 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every  
 slice
 and array that isn't going to change size (which I would argue is used  
 more
 often)?
I've written some micro benchmarks to test out the difference between an 8-byte struct (i.e. the current arrays) and a 16-byte struct (ptr, length, capacity, stride) with respect to function call performance. I wrote 2 benchmarks: 1) an actual function call (which calculated the exponential) and 2) an optimized struct copy (i.e. simulating pass by value). In release -o or debug mode, (1) had performance decreases of -5.1%, 9.8% and 1.8% for passing 1, 2 and 3 structs respectively. When -inline is enabled this changes to 0.5%, 2.5% and 5% respectively. For the (2) benchmark, I looked at using mmx and sse2 instructions to accelerate the struct copy. Using sse2 instructions provided a 15% performance gain using unaligned structs and an 84% gain using aligned structs. 16-byte SSE2 copies beat single 8-byte MMX copies by 9% (both aligned), though dual MMX (i.e. a 16-byte copy) still beat the built-in copy by 61% (again aligned). (All in -release -o -inline, on a Core2 CPU) So the performance cost, even for very small workloads, looks to be pretty minor and with some work performance could actually increase. (I'm assuming the use of sse/mmx/AltiVec for memcopy acceleration would be enabled by a complier flag on 32-bit)
Aug 24 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:
 I've written some micro benchmarks to test out the difference between an  
 8-byte struct (i.e. the current arrays) and a 16-byte struct (ptr, length,  
 capacity, stride)
You may want to address some of the things superdan has said. And, why the stride? I presume it's useful if you want a 2D matrix. It may be useful, but maybe it's overkill, so you may want to benchmark just (ptr, length, capacity) too. Another useful benchmark is to test a program that manages tons of short strings too, like this one: http://en.wikipedia.org/wiki/D_language#Example_2 Bye, bearophile
Aug 24 2008
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.
Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try). And lastly, SSE works on 16-byte structs, so why not use the extra 4-bytes?
 It may be useful, but maybe it's overkill, so you may want to benchmark  
 just (ptr, length, capacity) too.
For a 12-byte struct, either {ptr, length, capacity} or {ptr, length, stride}, the performance decrease for -release -o -inline was -0.7%, 0.3% and 0.2% for 1, 2 and 3 arrays respectively and -0%, 8% and 0.8% respectively for debug.
 Another useful benchmark is to test a program that manages tons of short  
 strings too, like this one:  
 http://en.wikipedia.org/wiki/D_language#Example_2
Okay, looking just at the main loop (writefln is really slow) the performance cost of the extra 4-bytes for a capacity field is about 1.7% (naive cost, i.e no SSE, etc). I didn't actually implement a capacity field, so this is a worst case.
 You may want to address some of the things superdan has said.
Superdan raised the good point that if the GC is aware of array length at creation, it might compact the capacity to allocate something else. To the best of my knowledge (which isn't saying much) both the Phobos and Tango GC are block base, and so are un-aware of the array length. Also, Sean Kelly wrote the Tango GC and isn't raising any red flags, which is probably a good sign it's a valid optimization.
Aug 24 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
Robert Jacques wrote:
 On Sun, 24 Aug 2008 22:14:36 -0400, bearophile 
 <bearophileHUGS lycos.com> wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.
Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try).
That's very convincing. Seems like stride ought to be in there. But it also got me thinking about what other 4-byte value could be stuffed into that extra space (satisfying the 16-bit alignment requirement), and I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!) you can do things like memoize the hashcode to avoid redundant future calculations. If Strings are mutable, you can mark a "dirty" bit when the string is changed, lazily recalculating a memoized hashcode based on that dirty bit. But only if you have a field for memoizing the hashcode in the first place. So, while a stride field seems very useful for some very interesting special cases, to me it seems like the more compelling general case is for a memoized hashcode. --benji
Aug 25 2008
next sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 Robert Jacques wrote:
 On Sun, 24 Aug 2008 22:14:36 -0400, bearophile 
 <bearophileHUGS lycos.com> wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.
Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try).
That's very convincing. Seems like stride ought to be in there. But it also got me thinking about what other 4-byte value could be stuffed into that extra space (satisfying the 16-bit alignment requirement), and I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!)
ok, i'll bite. what exactly are you so desperately asking for?
Aug 25 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 In languages where Strings are objects (please!!! please, Walter!!!)
ok, i'll bite. what exactly are you so desperately asking for?
For strings to be implemented as objects, rather than as arrays of characters. --benji
Aug 25 2008
parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 In languages where Strings are objects (please!!! please, Walter!!!)
ok, i'll bite. what exactly are you so desperately asking for?
For strings to be implemented as objects, rather than as arrays of characters. --benji
ok i assume u mean class objects. because current arrays are objects. only not of class type. and what advantages would that have. let me tell u of one disadvantage fer starters. with builtin arrays of characters you can efficiently implement classes. with classes you can't efficiently implement arrays of characters.
Aug 25 2008
parent Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 Benji Smith Wrote:
 
 superdan wrote:
 In languages where Strings are objects (please!!! please, Walter!!!)
ok, i'll bite. what exactly are you so desperately asking for?
For strings to be implemented as objects, rather than as arrays of characters. --benji
ok i assume u mean class objects. because current arrays are objects. only not of class type. and what advantages would that have. let me tell u of one disadvantage fer starters. with builtin arrays of characters you can efficiently implement classes. with classes you can't efficiently implement arrays of characters.
I'll start a new thread, so as not to hijack this one further. --benji
Aug 25 2008
prev sibling next sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 25 Aug 2008 11:41:51 -0400, Benji Smith <dlanguage benjismith.net>  
wrote:

 I can't help but yearn for hashcode memoization.

 Strings are, arguably, the most common types of arrays in most programs.  
   And strings are, arguably the most common key-type in associative  
 arrays. And every time you perform an associative array lookup, you have  
 to re-compute the hashcode (which is silly for immutable strings, since  
 the hashcode can't ever change).

 In languages where Strings are objects (please!!! please, Walter!!!) you  
 can do things like memoize the hashcode to avoid redundant future  
 calculations. If Strings are mutable, you can mark a "dirty" bit when  
 the string is changed, lazily recalculating a memoized hashcode based on  
 that dirty bit. But only if you have a field for memoizing the hashcode  
 in the first place.
This only works if strings are objects as there's an aliasing issue, so it won't work for D arrays. Consider: string a = "foo"; string b = a; int[string] counts; counts[b] = 1; a[0] = 'b'; // b's dirty bit isn't set counts[b]++; // calls counts["foo"]++ and not counts["boo"]++
 So, while a stride field seems very useful for some very interesting  
 special cases,
Well, I would argue that games, scientific computing, throughput computing and (lightweight) databases aren't special cases.
Aug 25 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:
 This only works if strings are objects as there's an aliasing issue, so it  
 won't work for D arrays.
Well, D 2.x strings (and their slices) are immutable, so there is no problem. I agree that for mutable data structures (that can be modified from references) storing the hash value is dangerous. A compromise may be to have mutable arrays but immutable slices, that may solve the problem you show. Slices are small, so having them immutable is probably not too much bad. Bye, bearophile
Aug 25 2008
parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 25 Aug 2008 21:51:33 +0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Robert Jacques:
 This only works if strings are objects as there's an aliasing issue, so  
 it
 won't work for D arrays.
Well, D 2.x strings (and their slices) are immutable, so there is no problem. I agree that for mutable data structures (that can be modified from references) storing the hash value is dangerous. A compromise may be to have mutable arrays but immutable slices, that may solve the problem you show. Slices are small, so having them immutable is probably not too much bad. Bye, bearophile
I don't follow you. Immutability of the slice implies that the whole array is immutable, too. Besides, how often do you take a hash of an array other than string?
Aug 25 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Denis Koroskin:
 I don't follow you.
After posting my message I have realized that my expressed thoughts are wrong, I am sorry. The topic looks simple, but it is not for me. This newsgroup is useful to learn intellectual humility better.
Immutability of the slice implies that the whole array is immutable, too.<
Well, only if immutability is transitive :-) Storing the hash value for a mutable data structure that can have references is dangerous, better to avoid doing it. But if a data structure is transitively immutable then storing its hash value may be positive if you use lot of hashing (hash value can be computed lazily only when required, a value of 0 may mean not computed yet. Python uses -1 for such purpose). Finding some compromise between those two extrema seems tricky to me. D 2.0 strings are immutable, so storing their hash value may be positive. The presence of such stored hash value may even be activated or not by a global compilation constant given to the compiler :o)
 Besides, how often do you take a hash of an array other than string?
In D I have never done it, so I presume not often. (In Python I use the hash value of immutable arrays often, because they are more handy than D structs.) Simple related note: capacity of the array must not be used to compute its hash value. Bye, bearophile
Aug 25 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Benji Smith:
 So, while a stride field seems very useful for some very interesting 
 special cases, to me it seems like the more compelling general case is 
 for a memoized hashcode.
Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) strings enjoy having a hash value there. The two cases are very distinct, with rare mixed cases. So an ugly idea: if the 4th array is 1D the field can be the hash value, if it's 2D it can contain the stride ;-) (But if superdan is right, all this discussion may need to be re-built on different basis). On the other hand, the capacity field isn't much useful for 2D arrays (because you usually append to 1D arrays), so you end with just 3 fields with 2D arrays and 4 fields for 1D arrays :-) So, if you want to keep both structures of len 4 (assuming pointers and size_t having the same size) you can add a second stride, useful for 3D arrays too ;-) Bye, bearophile
Aug 25 2008
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 25 Aug 2008 12:51:30 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:
 Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D)  
 strings enjoy having a hash value there. The two cases are very  
 distinct, with rare mixed cases.
 So an ugly idea: if the 4th array is 1D the field can be the hash value,  
 if it's 2D it can contain the stride ;-) (But if superdan is right, all  
 this discussion may need to be re-built on different basis).
 On the other hand, the capacity field isn't much useful for 2D arrays  
 (because you usually append to 1D arrays), so you end with just 3 fields  
 with 2D arrays and 4 fields for 1D arrays :-) So, if you want to keep  
 both structures of len 4 (assuming pointers and size_t having the same  
 size) you can add a second stride, useful for 3D arrays too ;-)
I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a length and a stride per dimension in order to support array slices, interfacing to both C and/or Fortran code bases, etc. So a ND array looks like { T* ptr; size_t[N] lengths; ptrdiff_t[N] byte_strides; } However, regarding your idea, as caching the has makes sense for invarient arrays and capacity would mostly be used during string bulding, you could overload the field based on type. However, this means neither optimization works for const arrays, which might not be a good thing as it encorages code bloat.
Aug 25 2008
parent "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 25 Aug 2008 18:45:30 -0400, Robert Jacques <sandford jhu.edu>  
wrote:

 On Mon, 25 Aug 2008 12:51:30 -0400, bearophile  
 <bearophileHUGS lycos.com> wrote:
 Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D)  
 strings enjoy having a hash value there. The two cases are very  
 distinct, with rare mixed cases.
 So an ugly idea: if the 4th array is 1D the field can be the hash  
 value, if it's 2D it can contain the stride ;-) (But if superdan is  
 right, all this discussion may need to be re-built on different basis).
 On the other hand, the capacity field isn't much useful for 2D arrays  
 (because you usually append to 1D arrays), so you end with just 3  
 fields with 2D arrays and 4 fields for 1D arrays :-) So, if you want to  
 keep both structures of len 4 (assuming pointers and size_t having the  
 same size) you can add a second stride, useful for 3D arrays too ;-)
I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a length and a stride per dimension in order to support array slices, interfacing to both C and/or Fortran code bases, etc. So a ND array looks like { T* ptr; size_t[N] lengths; ptrdiff_t[N] byte_strides; } However, regarding your idea, as caching the has makes sense for invarient arrays and capacity would mostly be used during string bulding, you could overload the field based on type. However, this means neither optimization works for const arrays, which might not be a good thing as it encorages code bloat.
Using these optimizations with const could be done with either a bit-flag (in length) or utilizing the block nature of the GC which leads to capacities with (at least 4 I think) zero low order bits. P.S. byte_strides should probably go above lengths as it's a better of memory layout.
Aug 25 2008
prev sibling parent JAnderson <ask me.com> writes:
Benji Smith wrote:
 Robert Jacques wrote:
 On Sun, 24 Aug 2008 22:14:36 -0400, bearophile 
 <bearophileHUGS lycos.com> wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.
Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try).
That's very convincing. Seems like stride ought to be in there. But it also got me thinking about what other 4-byte value could be stuffed into that extra space (satisfying the 16-bit alignment requirement), and I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!) you can do things like memoize the hashcode to avoid redundant future calculations. If Strings are mutable, you can mark a "dirty" bit when the string is changed, lazily recalculating a memoized hashcode based on that dirty bit. But only if you have a field for memoizing the hashcode in the first place.
to use stringbuilder to get around the very slow string pooling. I think lazily evaluating it would be a good idea. -Joel
Aug 25 2008
prev sibling parent JAnderson <ask me.com> writes:
dsimcha wrote:
 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
 capacity field in the array itself?  I think the problem here is the need to
call
 gc.capacity(), which can require a lock, as Sean said.  If you have the
capacity
 field embedded in the array itself, you never need a lock for ~= unless the
array
 actually needs to be realloced or is appended to by more than one thread.  I've
 written classes before that need fast appending and therefore cache the
capacity
 value every time the array is reallocated, and it makes it a ton faster even in
 single threaded code.  On the other hand, this breaks the ABI for D arrays,
but D2
 breaks a lot of backwards compatibility anyhow, so maybe this is still a good
 idea.  Does anyone actually write C functions that rely on this ABI and then
call
 them from D?  Furthermore, having a capacity field would also allow reserve for
 builtin arrays.
 
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?
Another solution would be to cache the capacity on the stack for the array so its only called once. That wouldn't be quite as fast but wouldn't break the ABI. -Joel
Aug 23 2008
prev sibling next sibling parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g8ni6o$12k2$1 digitalmars.com...
 Note2: In the D code the final deallocation of the array is fast enough, 
 and its allocation is fast too (you can see this timing just the for loop 
 in the middle of the program). The really slow part seems to be the loop 
 itself.
If the others are right, D's probably using std.gc.capacity to constantly check the capacity of the array. Interestingly, the Phobos GC caches the pointer/size in a call to "capacity", so that requesting the capacity of the same pointer immediately returns its capacity without any look-up. See function sizeOfNoSync, in internal/gc/gcx.d. Additionally, if there are no other threads, the GC lock is not taken, so capacity() should be O(1). It would be interesting to check whether the cached pointer/size in the GC is actually being used. If it is not, it would result in a call to findSize() which in turn calls findPool(). findPool() is O(n), n = number of pools. The pooltable is already sorted, so this can be made O(log n). findSize() then does a linear search to check the size of the alloced block, which is O(m), m = size of block. Result: getting the capacity of a pointer is O(m+n). There's a lot of room for improvement right there. L.
Aug 24 2008
parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
Nevermind. None of that explained why bearophile's program is so slow. This 
does:

int[] ~= int is using _d_arrayappendcT (from internal/gc/gc.d), which 
appends an int by appending its four bytes! Blasphemy! Now we have great 
super-duper-efficient array operations using SSEx but a simple append of an 
int to an int[] is using a "rep movs" to append 4 bytes. X-(

Also, the append code first checks whether the array should be resized, and 
if not, jumps to the actual append code. The common case should not be 
branching: { if (too small) goto resize; append: ... return; resize: ... 
goto append; }

The use of _d_arrayappendcT is hardcoded in the D compiler (e2ir.c) so I 
don't know how to create different code paths for different primitive 
(sizes) other than to add many ifs to _d_arrayappendcT. IIRC, TypeInfo 
contained a pointer to a method for fast array comparisons. Shouldn't a 
similar thing be added for array appends?

BTW, the capacity look-up was indeed using the cached values. Meaning, it's 
not "capacity" we should be focussing on, but the general append code. 
However, if you were to append to two different arrays in a similar loop, 
the performance would be terrible, since each array would overwrite the 
cache with its own ptr/size, resulting in the slow O(m+n) for each capacity 
look-up.

L. 
Aug 24 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Lionello Lunesu:
However, if you were to append to two different arrays in a similar loop, the
performance would be terrible, since each array would overwrite the cache with
its own ptr/size, resulting in the slow O(m+n) for each capacity look-up.<
Yes, there's something that deserves some improvements, probably at various levels. I have modified the benchmarks in the way you have explained (see below), with n = 4 millions the running time of the C++ version is barely testable (0.04 s), the D version takes 32.2 s (I have had not enough patience to see it finish with n = 10 millions). Bye, bearophile ------------------ C++: #include "stdio.h" #include <vector> using namespace std; int main(int argc, char **argv) { int n = argc >= 2 ? atoi(argv[1]) : 1000; vector<int> a1; a1.reserve(n); vector<int> a2; a2.reserve(n); for (int i = 0; i < n; ++i) { a1.push_back(i); a2.push_back(i); } printf("%d %d\n", a1[n - 1], a2[n - 1]); return 0; } ------------------ C++: import std.conv: toInt; import std.gc: malloc, hasNoPointers; int main(string[] args) { int n = args.length >= 2 ? toInt(args[1]) : 1000; int* aux_ptr1 = cast(int*)malloc(n * int.sizeof); hasNoPointers(aux_ptr1); int[] a1 = aux_ptr1[0 .. 0]; int* aux_ptr2 = cast(int*)malloc(n * int.sizeof); hasNoPointers(aux_ptr2); int[] a2 = aux_ptr2[0 .. 0]; for (int i; i < n; i++) { a1 ~= i; a2 ~= i; } printf("%d %d\n", a1[n - 1], a2[n - 1]); return 0; }
Aug 25 2008
prev sibling next sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 25 Aug 2008 01:41:51 -0400, Lionello Lunesu  
<lionello lunesu.remove.com> wrote:

 BTW, the capacity look-up was indeed using the cached values. Meaning,  
 it's not "capacity" we should be focussing on, but the general append  
 code. However, if you were to append to two different arrays in a  
 similar loop, the performance would be terrible, since each array would  
 overwrite the cache with its own ptr/size, resulting in the slow O(m+n)  
 for each capacity look-up.
Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2
Aug 25 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:
 Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2
For the people interested, here are few slides written by the good Hettinger that show how dict/set/lists work in Python (it shows how the array append too works): http://www.pycon.it/static/pycon2/slides/containers.ppt Bye, bearophile
Aug 25 2008
prev sibling parent JAnderson <ask me.com> writes:
Lionello Lunesu wrote:
 Nevermind. None of that explained why bearophile's program is so slow. 
 This does:
 
 int[] ~= int is using _d_arrayappendcT (from internal/gc/gc.d), which 
 appends an int by appending its four bytes! Blasphemy! Now we have great 
 super-duper-efficient array operations using SSEx but a simple append of 
 an int to an int[] is using a "rep movs" to append 4 bytes. X-(
 
 Also, the append code first checks whether the array should be resized, 
 and if not, jumps to the actual append code. The common case should not 
 be branching: { if (too small) goto resize; append: ... return; resize: 
 ... goto append; }
 
 The use of _d_arrayappendcT is hardcoded in the D compiler (e2ir.c) so I 
 don't know how to create different code paths for different primitive 
 (sizes) other than to add many ifs to _d_arrayappendcT. IIRC, TypeInfo 
 contained a pointer to a method for fast array comparisons. Shouldn't a 
 similar thing be added for array appends?
 
 BTW, the capacity look-up was indeed using the cached values. Meaning, 
 it's not "capacity" we should be focussing on, but the general append 
 code. However, if you were to append to two different arrays in a 
 similar loop, the performance would be terrible, since each array would 
 overwrite the cache with its own ptr/size, resulting in the slow O(m+n) 
 for each capacity look-up.
 
 L.
Great stuff! -Joel
Aug 25 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
The bugzilla report on this suggests replacing []=[] with memcpy. But 
this benchmark shows this isn't a consistent win:

import std.c.string;

long timer()
{
     asm
     {   naked                   ;
         rdtsc                   ;
         ret                     ;
     }
}

void test1(byte[] a, byte[] b)
{
     a[] = b[];
}

void test2(byte[] a, byte[] b)
{
     memcpy(a.ptr, b.ptr, a.length);
}

void main()
{
     for (int i = 4; i < 100_000_000; i *= 2)
     {
         auto a = new byte[i];
         auto b = new byte[i];

         auto start = timer();
         test1(a, b);
         auto end = timer();
         auto r1 = end - start;

         start = timer();
         test2(a, b);
         end = timer();
         auto r2 = end - start;

         printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
     }
}

Running this program produces:

i:        4,    []=[]:      144,        memcpy:      568
i:        8,    []=[]:      144,        memcpy:      300
i:       16,    []=[]:      172,        memcpy:      324
i:       32,    []=[]:      204,        memcpy:      344
i:       64,    []=[]:      288,        memcpy:      276
i:      128,    []=[]:      288,        memcpy:      272
i:      256,    []=[]:      352,        memcpy:      364
i:      512,    []=[]:      372,        memcpy:      424
i:     1024,    []=[]:      552,        memcpy:      564
i:     2048,    []=[]:      684,        memcpy:     1384
i:     4096,    []=[]:     1344,        memcpy:     1772
i:     8192,    []=[]:     2900,        memcpy:     3216
i:    16384,    []=[]:     5292,        memcpy:     5472
i:    32768,    []=[]:    11496,        memcpy:    10388
i:    65536,    []=[]:    29484,        memcpy:    27480
i:   131072,    []=[]:   110464,        memcpy:    67784
i:   262144,    []=[]:   655580,        memcpy:   562400
i:   524288,    []=[]:  1204124,        memcpy:  1107256
i:  1048576,    []=[]:  2364588,        memcpy:  2272552
i:  2097152,    []=[]:  4516440,        memcpy:  4417764
i:  4194304,    []=[]:  8996992,        memcpy:  8817176
i:  8388608,    []=[]: 20223908,        memcpy: 17717748
i: 16777216,    []=[]: 35774952,        memcpy: 36094652
i: 33554432,    []=[]: 71008068,        memcpy: 71246896
i: 67108864,    []=[]: 142982284,       memcpy: 145473300

There's not much of a consistent conclusion to be drawn from that.
Aug 26 2008
parent reply Lionello Lunesu <lio lunesu.remove.com> writes:
The problem is that the timing for the small arrays cannot be trusted as 
there's more overhead than actual code being tested. I've changed your 
code by doing each test 100_000_000/i times. I've also added 'cpuid' for 
synchronization, as Don suggested:

import std.c.string;

long timer()
{
     asm
     {   naked                   ;
         push ECX;
         push EBX;
         mov EAX, 0              ;
         cpuid                   ;
         pop EBX;
         pop ECX;
         rdtsc                   ;

         ret                     ;
     }
}

void test1(byte[] a, byte[] b)
{
     a[] = b[];
}

void test2(byte[] a, byte[] b)
{
     memcpy(a.ptr, b.ptr, a.length);
}

void main()
{
     for (int i = 4; i < 100_000_000; i *= 2)
     {
         auto b = new byte[i];
         auto a = new byte[i];

         test1(a,b);//pre-cache

         auto count = 100_000_000 / i;
         auto start = timer();
         while (count--)
             test1(a, b);
         auto end = timer();
         auto r1 = end - start;

         count = 100_000_000 / i;
         start = timer();
         while (count--)
             test2(a, b);
         end = timer();
         auto r2 = end - start;

         printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
     }
}

Results on my Core2 Duo:

i:        4,    []=[]: 2752539660,      memcpy: 429537312
i:        8,    []=[]: 1004458383,      memcpy: 229984623
i:       16,    []=[]: 492054948,       memcpy: 128590560
i:       32,    []=[]: 309960657,       memcpy: 135892809
i:       64,    []=[]: 204971832,       memcpy: 79507359
i:      128,    []=[]: 131925789,       memcpy: 52222626
i:      256,    []=[]: 73768896,        memcpy: 40184559
i:      512,    []=[]: 43827948,        memcpy: 28040859
i:     1024,    []=[]: 29030985,        memcpy: 21078432
i:     2048,    []=[]: 21645027,        memcpy: 18953199
i:     4096,    []=[]: 18099945,        memcpy: 16022259
i:     8192,    []=[]: 15905007,        memcpy: 15399036
i:    16384,    []=[]: 15567399,        memcpy: 15795747
i:    32768,    []=[]: 32244507,        memcpy: 30654063
i:    65536,    []=[]: 30824595,        memcpy: 31286475
i:   131072,    []=[]: 32248179,        memcpy: 30950757
i:   262144,    []=[]: 34880868,        memcpy: 31166667
i:   524288,    []=[]: 31220802,        memcpy: 35813277
i:  1048576,    []=[]: 42463341,        memcpy: 42121386
i:  2097152,    []=[]: 122218776,       memcpy: 118839150
i:  4194304,    []=[]: 111864294,       memcpy: 110864988
i:  8388608,    []=[]: 110600226,       memcpy: 107638083
i: 16777216,    []=[]: 97014645,        memcpy: 95789574
i: 33554432,    []=[]: 79344468,        memcpy: 78277374
i: 67108864,    []=[]: 77437629,        memcpy: 78478767

Conclusion: memcpy wins big time for i <= 8192.
Aug 27 2008
next sibling parent Lionello Lunesu <lio lunesu.remove.com> writes:
Lionello Lunesu wrote:
 Results on my Core2 Duo:
D'oh! I should have let Don guess it! L.
Aug 27 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Lionello Lunesu:
 Conclusion: memcpy wins big time for i <= 8192.
I have modified your code a little: import std.c.string: memcpy; long timer() { asm { naked; push ECX; push EBX; mov EAX, 0; cpuid; pop EBX; pop ECX; rdtsc; ret; } } void main() { const int N = 500_000_000; alias ubyte DataType; for (int i = 4; i < N; i *= 2) { auto b = new DataType[i]; auto a = new DataType[i]; a[] = b[]; // pre-cache auto count = N / i; auto start = timer(); while (count--) a[] = b[]; auto end = timer(); auto t1 = end - start; count = N / i; start = timer(); while (count--) memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); end = timer(); auto t2 = end - start; count = N / i; start = timer(); while (count--) for (int j; j < i; j++) a[j] = b[j]; end = timer(); auto t3 = end - start; count = N / i; start = timer(); while (count--) { auto pa = a.ptr; auto pb = b.ptr; int nloop = i; while (nloop--) *pa++ = *pb++; } end = timer(); auto t4 = end - start; printf("i: %8d,\t[]=[]: %8lld, \tt1/t2: %.2f \tt1/t3: %.2f \tt1/t4: %.2f\n", i, t1, cast(float)t1/t2, cast(float)t1/t3, cast(float)t1/t4); } } The timings: N = 500_000_000, DataType = ubyte: i: 4, []=[]: 9919922030, t1/t2: 4.64 t1/t3: 5.64 t1/t4: 5.27 i: 8, []=[]: 4957640530, t1/t2: 4.63 t1/t3: 3.59 t1/t4: 2.93 i: 16, []=[]: 2478916670, t1/t2: 4.10 t1/t3: 2.08 t1/t4: 1.55 i: 32, []=[]: 1537003970, t1/t2: 2.28 t1/t3: 1.40 t1/t4: 0.99 i: 64, []=[]: 1019418780, t1/t2: 2.59 t1/t3: 0.97 t1/t4: 0.67 i: 128, []=[]: 659281070, t1/t2: 2.54 t1/t3: 0.60 t1/t4: 0.58 i: 256, []=[]: 364781180, t1/t2: 1.90 t1/t3: 0.35 t1/t4: 0.34 i: 512, []=[]: 217207030, t1/t2: 1.56 t1/t3: 0.21 t1/t4: 0.21 i: 1024, []=[]: 143970290, t1/t2: 1.37 t1/t3: 0.14 t1/t4: 0.14 i: 2048, []=[]: 107283300, t1/t2: 1.21 t1/t3: 0.11 t1/t4: 0.11 i: 4096, []=[]: 88801730, t1/t2: 1.11 t1/t3: 0.08 t1/t4: 0.08 i: 8192, []=[]: 79605200, t1/t2: 1.06 t1/t3: 0.07 t1/t4: 0.07 i: 16384, []=[]: 77892400, t1/t2: 0.99 t1/t3: 0.07 t1/t4: 0.07 i: 32768, []=[]: 156926780, t1/t2: 1.03 t1/t3: 0.16 t1/t4: 0.16 i: 65536, []=[]: 154375350, t1/t2: 1.01 t1/t3: 0.15 t1/t4: 0.15 i: 131072, []=[]: 154758850, t1/t2: 1.01 t1/t3: 0.15 t1/t4: 0.15 i: 262144, []=[]: 155175500, t1/t2: 1.01 t1/t3: 0.15 t1/t4: 0.15 i: 524288, []=[]: 159647160, t1/t2: 1.01 t1/t3: 0.16 t1/t4: 0.16 i: 1048576, []=[]: 832590470, t1/t2: 1.00 t1/t3: 0.79 t1/t4: 0.79 i: 2097152, []=[]: 783795470, t1/t2: 1.00 t1/t3: 0.76 t1/t4: 0.75 i: 4194304, []=[]: 828801230, t1/t2: 1.00 t1/t3: 0.79 t1/t4: 0.78 i: 8388608, []=[]: 813813550, t1/t2: 1.00 t1/t3: 0.78 t1/t4: 0.78 i: 16777216, []=[]: 772143790, t1/t2: 1.00 t1/t3: 0.77 t1/t4: 0.76 i: 33554432, []=[]: 772579250, t1/t2: 1.00 t1/t3: 0.78 t1/t4: 0.78 i: 67108864, []=[]: 840918700, t1/t2: 1.00 t1/t3: 0.84 t1/t4: 0.83 i: 134217728, []=[]: 644072850, t1/t2: 1.00 t1/t3: 0.77 t1/t4: 0.77 i: 268435456, []=[]: 446295660, t1/t2: 0.99 t1/t3: 0.79 t1/t4: 0.78 So for DataType = ubyte: i = 4: T3 better i in [8, 32768] T2 better i > 32768: T1 and T2 equally good N = 500_000_000, DataType = uint: i: 4, []=[]: 9789704410, t1/t2: 3.72 t1/t3: 6.00 t1/t4: 4.87 i: 8, []=[]: 6088621920, t1/t2: 2.25 t1/t3: 4.62 t1/t4: 3.46 i: 16, []=[]: 4049677930, t1/t2: 2.58 t1/t3: 3.31 t1/t4: 2.48 i: 32, []=[]: 2620962840, t1/t2: 2.53 t1/t3: 2.31 t1/t4: 1.67 i: 64, []=[]: 1451602120, t1/t2: 1.89 t1/t3: 1.39 t1/t4: 0.94 i: 128, []=[]: 867937320, t1/t2: 1.56 t1/t3: 0.79 t1/t4: 0.74 i: 256, []=[]: 575548140, t1/t2: 1.37 t1/t3: 0.55 t1/t4: 0.53 i: 512, []=[]: 428628560, t1/t2: 1.21 t1/t3: 0.42 t1/t4: 0.41 i: 1024, []=[]: 355175800, t1/t2: 1.11 t1/t3: 0.35 t1/t4: 0.35 i: 2048, []=[]: 318880440, t1/t2: 1.06 t1/t3: 0.32 t1/t4: 0.31 i: 4096, []=[]: 307818740, t1/t2: 1.02 t1/t3: 0.30 t1/t4: 0.30 i: 8192, []=[]: 628989600, t1/t2: 1.03 t1/t3: 0.62 t1/t4: 0.62 i: 16384, []=[]: 618088340, t1/t2: 1.01 t1/t3: 0.61 t1/t4: 0.61 i: 32768, []=[]: 620081160, t1/t2: 1.01 t1/t3: 0.62 t1/t4: 0.61 i: 65536, []=[]: 622162940, t1/t2: 1.01 t1/t3: 0.62 t1/t4: 0.61 i: 131072, []=[]: 633820200, t1/t2: 1.00 t1/t3: 0.61 t1/t4: 0.61 i: 262144, []=[]: 3239626180, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.04 i: 524288, []=[]: 3145529870, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 1048576, []=[]: 3319243820, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.05 i: 2097152, []=[]: 3292952240, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.03 i: 4194304, []=[]: 3159184090, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 8388608, []=[]: 3261583900, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 16777216, []=[]: 3462604700, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.03 i: 33554432, []=[]: 2971217250, t1/t2: 0.99 t1/t3: 1.05 t1/t4: 1.05 i: 67108864, []=[]: 3136278270, t1/t2: 0.99 t1/t3: 1.03 t1/t4: 1.03 So for DataType = uint: i in [4, 16]: T3 better i in [32, 8192] T2 better i > 16384: T1 and T2 equally good N = 500_000_000, DataType = ulong: i: 4, []=[]: 12176650300, t1/t2: 2.26 t1/t3: 5.67 t1/t4: 5.10 i: 8, []=[]: 8095535110, t1/t2: 2.58 t1/t3: 4.39 t1/t4: 3.69 i: 16, []=[]: 5241679510, t1/t2: 2.53 t1/t3: 3.15 t1/t4: 2.49 i: 32, []=[]: 2904270580, t1/t2: 1.89 t1/t3: 1.83 t1/t4: 1.41 i: 64, []=[]: 1733200310, t1/t2: 1.56 t1/t3: 1.12 t1/t4: 0.85 i: 128, []=[]: 1149144250, t1/t2: 1.37 t1/t3: 1.02 t1/t4: 0.70 i: 256, []=[]: 856914580, t1/t2: 1.21 t1/t3: 0.80 t1/t4: 0.54 i: 512, []=[]: 711247630, t1/t2: 1.11 t1/t3: 0.69 t1/t4: 0.46 i: 1024, []=[]: 638308500, t1/t2: 1.06 t1/t3: 0.63 t1/t4: 0.42 i: 2048, []=[]: 614686480, t1/t2: 0.99 t1/t3: 0.57 t1/t4: 0.40 i: 4096, []=[]: 1254471740, t1/t2: 1.02 t1/t3: 0.70 t1/t4: 0.69 i: 8192, []=[]: 1238218710, t1/t2: 1.01 t1/t3: 0.69 t1/t4: 0.68 i: 16384, []=[]: 1240384550, t1/t2: 1.01 t1/t3: 0.69 t1/t4: 0.68 i: 32768, []=[]: 1244624520, t1/t2: 1.01 t1/t3: 0.69 t1/t4: 0.69 i: 65536, []=[]: 1272774050, t1/t2: 1.00 t1/t3: 0.69 t1/t4: 0.69 i: 131072, []=[]: 6532720530, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.05 i: 262144, []=[]: 6315104980, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 524288, []=[]: 6663898750, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.05 i: 1048576, []=[]: 6562860030, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 2097152, []=[]: 6322608930, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 So for DataType = ulong: i in [4, 16]: T3 better i in [32, 4096] T2 better i > 8192: T1 and T2 equally good CPU used: Pentium Dual-Core E2180, with caches 64 KB L1 (32 KB data, 32 KB program) and 1 MB L2. But there are ways to go much faster: http://cdrom.amd.com/devconn/events/gdc_2002_amd.pdf http://files.rsdn.ru/23380/AMD_block_prefetch_paper.pdf Bye, bearophile
Aug 27 2008
parent reply Lionello Lunesu <lio lunesu.remove.com> writes:
--snip--

I think it safe to conclude that memcpy (T2) is the fastest all-round 
solution. Only the inlined custom code can beat it.

What's more, to use memcpy only one line in gc.d needs to be changed. 
Both T3 and T4 would need a change in the compiler.

L.
Aug 27 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Lionello Lunesu:

 I think it safe to conclude that memcpy (T2) is the fastest all-round 
 solution. Only the inlined custom code can beat it.
 What's more, to use memcpy only one line in gc.d needs to be changed. 
 Both T3 and T4 would need a change in the compiler.
From more benchmarks I have seen that if you don't use ASM (with prefetching, etc) and/or more clever code, the following is the faster from DataType from ubyte to ulong: template Range(int stop) { static if (stop <= 0) alias Tuple!() Range; else alias Tuple!(Range!(stop-1), stop-1) Range; } switch (i) { case 0: break; case 1: foreach (j; Range!(1)) a[j] = b[j]; break; case 2: foreach (j; Range!(2)) a[j] = b[j]; break; case 3: foreach (j; Range!(3)) a[j] = b[j]; break; case 4: foreach (j; Range!(4)) a[j] = b[j]; break; case 5: foreach (j; Range!(5)) a[j] = b[j]; break; case 6: foreach (j; Range!(6)) a[j] = b[j]; break; case 7: foreach (j; Range!(7)) a[j] = b[j]; break; case 8: foreach (j; Range!(8)) a[j] = b[j]; break; default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); } A possible disadvantage may come from the fact that such code requires several instructions, so it increases the traffic in the L1 code cache. Bye, bearophile
Aug 27 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Lionello Lunesu:
 I think it safe to conclude that memcpy (T2) is the fastest all-round
 solution. Only the inlined custom code can beat it.
 What's more, to use memcpy only one line in gc.d needs to be changed.
 Both T3 and T4 would need a change in the compiler.
From more benchmarks I have seen that if you don't use ASM (with prefetching, etc) and/or more clever code, the following
is the faster from DataType from ubyte to ulong:
 template Range(int stop) {
     static if (stop <= 0)
         alias Tuple!() Range;
     else
         alias Tuple!(Range!(stop-1), stop-1) Range;
 }
 switch (i) {
     case 0: break;
     case 1: foreach (j; Range!(1)) a[j] = b[j]; break;
     case 2: foreach (j; Range!(2)) a[j] = b[j]; break;
     case 3: foreach (j; Range!(3)) a[j] = b[j]; break;
     case 4: foreach (j; Range!(4)) a[j] = b[j]; break;
     case 5: foreach (j; Range!(5)) a[j] = b[j]; break;
     case 6: foreach (j; Range!(6)) a[j] = b[j]; break;
     case 7: foreach (j; Range!(7)) a[j] = b[j]; break;
     case 8: foreach (j; Range!(8)) a[j] = b[j]; break;
     default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof);
 }
Why not just change that to this instead: switch(i) { case 8: a[7] = b[7]; case 7: a[6] = b[6]; case 6: a[5] = b[5]; case 5: a[4] = b[4]; case 4: a[3] = b[3]; case 3: a[2] = b[2]; case 1: a[0] = b[0]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } The Range stuff is cool and all, but not really necessary :-) Sean
 A possible disadvantage may come from the fact that such code requires several
instructions, so it increases the traffic in the
L1 code cache.
 Bye,
 bearophile
Aug 27 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Sean Kelly:
 Why not just change that to this instead:
 
 switch(i) {
     case 8: a[7] = b[7];
     case 7: a[6] = b[6];
     case 6: a[5] = b[5];
     case 5: a[4] = b[4];
     case 4: a[3] = b[3];
     case 3: a[2] = b[2];
     case 1: a[0] = b[0];
     case 0: break;
     default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
 }
 
 The Range stuff is cool and all, but not really necessary :-)
I have benchmarked many alternative versions, your one too, but mine is significantly faster than your one on the DMD I am using. Don't ask me why. You can find the Range!([start,] stop[, step]) in my libs, of course ;-) Bye, bearophile
Aug 27 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-27 17:20:55 +0200, bearophile <bearophileHUGS lycos.com> said:

 Sean Kelly:
 Why not just change that to this instead:
 
 switch(i) {
 case 8: a[7] = b[7];
 case 7: a[6] = b[6];
 case 6: a[5] = b[5];
 case 5: a[4] = b[4];
 case 4: a[3] = b[3];
 case 3: a[2] = b[2];
 case 1: a[0] = b[0];
 case 0: break;
 default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
 }
 
 The Range stuff is cool and all, but not really necessary :-)
I have benchmarked many alternative versions, your one too, but mine is significantly faster than your one on the DMD I am using. Don't ask me why.
memory ordering probably, looping from 0 to 7 is faster than 7 to 0...
 
 You can find the Range!([start,] stop[, step]) in my libs, of course ;-)
 
 Bye,
 bearophile
Aug 27 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-28 00:22:21 +0200, Fawzi Mohamed <fmohamed mac.com> said:

 On 2008-08-27 17:20:55 +0200, bearophile <bearophileHUGS lycos.com> said:
 
 Sean Kelly:
 Why not just change that to this instead:
 
 switch(i) {
 case 8: a[7] = b[7];
 case 7: a[6] = b[6];
 case 6: a[5] = b[5];
 case 5: a[4] = b[4];
 case 4: a[3] = b[3];
 case 3: a[2] = b[2];
 case 1: a[0] = b[0];
 case 0: break;
 default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
 }
 
 The Range stuff is cool and all, but not really necessary :-)
I have benchmarked many alternative versions, your one too, but mine is significantly faster than your one on the DMD I am using. Don't ask me why.
memory ordering probably, looping from 0 to 7 is faster than 7 to 0...
so int j=j; switch(i) { case 8: a[j] = b[j]; ++j; case 7: a[j] = b[j]; ++j; case 6: a[j] = b[j]; ++j; case 5: a[j] = b[j]; ++j; case 4: a[j] = b[j]; ++j; case 3: a[j] = b[j]; ++j; case 1: a[j] = b[j]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } should be better... Fawzi
 
 
 You can find the Range!([start,] stop[, step]) in my libs, of course ;-)
 
 Bye,
 bearophile
Aug 27 2008
parent bearophile <bearophileHUGS lycoc.com> writes:
Fawzi Mohamed:
 memory ordering probably, looping from 0 to 7 is faster than 7 to 0...
I don't think that's the problem.
 so
...
 should be better...
It is not, it seems. For reference below some code you can run: import std.c.string: memcpy; template Tuple(T...) { alias T Tuple; } long timer() { asm { naked; push ECX; push EBX; mov EAX, 0; cpuid; pop EBX; pop ECX; rdtsc; ret; } } template Range(int stop) { static if (stop <= 0) alias Tuple!() Range; else alias Tuple!(Range!(stop-1), stop-1) Range; } void main() { const int N = 500_000_000; alias uint DataType; for (int i = 4; i < N; i *= 2) { auto b = new DataType[i]; auto a = new DataType[i]; a[] = b[]; // pre-cache auto count = N / i; auto start = timer(); while (count--) a[] = b[]; auto end = timer(); auto t1 = end - start; count = N / i; start = timer(); while (count--) memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); end = timer(); auto t2 = end - start; count = N / i; start = timer(); while (count--) for (int j; j < i; j++) a[j] = b[j]; end = timer(); auto t3 = end - start; count = N / i; start = timer(); while (count--) { auto pa = a.ptr; auto pb = b.ptr; int nloop = i; while (nloop--) *pa++ = *pb++; } end = timer(); auto t4 = end - start; count = N / i; start = timer(); while (count--) { // apparently the faster version switch (i) { case 0: break; case 1: foreach (j; Range!(1)) a[j] = b[j]; break; case 2: foreach (j; Range!(2)) a[j] = b[j]; break; case 3: foreach (j; Range!(3)) a[j] = b[j]; break; case 4: foreach (j; Range!(4)) a[j] = b[j]; break; case 5: foreach (j; Range!(5)) a[j] = b[j]; break; case 6: foreach (j; Range!(6)) a[j] = b[j]; break; case 7: foreach (j; Range!(7)) a[j] = b[j]; break; case 8: foreach (j; Range!(8)) a[j] = b[j]; break; default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); } } end = timer(); auto t5 = end - start; count = N / i; start = timer(); while (count--) { int j = i; switch(i) { case 8: a[j] = b[j]; ++j; case 7: a[j] = b[j]; ++j; case 6: a[j] = b[j]; ++j; case 5: a[j] = b[j]; ++j; case 4: a[j] = b[j]; ++j; case 3: a[j] = b[j]; ++j; case 1: a[j] = b[j]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } } end = timer(); auto t6 = end - start; printf("i: %8d,\t[]=[]: %8lld, \tt1/t2: %.2f \tt1/t3: %.2f \tt1/t4: %.2f \tt1/t5: %.2f \tt1/t6: %.2f\n", i, t1, cast(float)t1/t2, cast(float)t1/t3, cast(float)t1/t4, cast(float)t1/t5, cast(float)t1/t6); } } Bye, bearophile
Aug 27 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Lionello Lunesu wrote:
 The problem is that the timing for the small arrays cannot be trusted as 
 there's more overhead than actual code being tested. I've changed your 
 code by doing each test 100_000_000/i times. I've also added 'cpuid' for 
 synchronization, as Don suggested:
I suspect that copying the same data over and over again is not representative of actual usage, and so is not giving useful results. I also do not understand the usage of cpuid here. I've used rdtsc for years for profiling purposes, and it has given me results that match the clock-on-the-wall results.
Aug 28 2008
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Essentially, I'm having a real hard time believing that for the 4 byte 
case, memcpy is faster. Take a look at the actual implementation of 
memcpy, vs the code generated by the compiler.
Aug 28 2008
parent reply "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g95mh3$2e5i$2 digitalmars.com...
 Essentially, I'm having a real hard time believing that for the 4 byte 
 case, memcpy is faster. Take a look at the actual implementation of 
 memcpy, vs the code generated by the compiler.
I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb" and that must be the slowest way there is to copy 4 bytes. In fact, from my demo scene experience I've learned to not use those CISC instructions like rep and movs* and loop. mov/dec/jnz always proved faster. However, I do think that it's a good idea to move the []=[] code into the run-time library. That way specialized versions can be written, like the other array operations. (Maybe it's already in the lib, but I haven't found it. I've only seen mention of OPmemcpy in e2ir.c which I assume represents the asm code for a memcpy.) L.
Aug 28 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Lionello Lunesu wrote:
 
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g95mh3$2e5i$2 digitalmars.com...
 Essentially, I'm having a real hard time believing that for the 4 byte 
 case, memcpy is faster. Take a look at the actual implementation of 
 memcpy, vs the code generated by the compiler.
I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb" and that must be the slowest way there is to copy 4 bytes. In fact, from my demo scene experience I've learned to not use those CISC instructions like rep and movs* and loop. mov/dec/jnz always proved faster.
But for 4 bytes, how much slower could it be than memcpy, which executes a lot of setup code?
 However, I do think that it's a good idea to move the []=[] code into 
 the run-time library. That way specialized versions can be written, like 
 the other array operations. (Maybe it's already in the lib, but I 
 haven't found it. I've only seen mention of OPmemcpy in e2ir.c which I 
 assume represents the asm code for a memcpy.)
OPmemcpy is an operator supported by the compiler back end, it generates the movsb's.
Aug 28 2008
parent "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g96sji$2s1a$1 digitalmars.com...
 But for 4 bytes, how much slower could it be than memcpy, which executes a 
 lot of setup code?
I don't know. I do know that I noticed a speed difference and that you asked me personally to do some more timings. And I've provided you with detailed timings, which prove, against popular belief, that memcpy is faster than rep movsb. Now apparently the conclusion is that A) "copying the same data over and over again is not representative of actual usage", and B) "[you're] having a real hard time believing that for the 4 byte case, memcpy is faster." I guess this subject is closed, but somehow I'm left with a somewhat unsatisfied feeling.. L.
Aug 29 2008
prev sibling parent reply Don <nospam nospam.com.au> writes:
Walter Bright wrote:
 Lionello Lunesu wrote:
 The problem is that the timing for the small arrays cannot be trusted 
 as there's more overhead than actual code being tested. I've changed 
 your code by doing each test 100_000_000/i times. I've also added 
 'cpuid' for synchronization, as Don suggested:
I suspect that copying the same data over and over again is not representative of actual usage, and so is not giving useful results. I also do not understand the usage of cpuid here. I've used rdtsc for years for profiling purposes, and it has given me results that match the clock-on-the-wall results.
Using cpuid only once doesn't work. The rationale ultimately comes from here, I think: http://cs.smu.ca/~jamuir/rdtscpm1.pdf But you've got me thinking. So, serialisation only happens with cpuid. But, WHEN is lack of serialisation actually a problem? On my Pentium M, rtdsc takes 13 uops, all using execution port p0. This means that anything on the other ports could execute after it. The only instructions with a latency longer than 13 are div, aam, fdiv, the transcendental floating point instructions, and the bizarro instructions aam, fbld, fbstp, and cpuid. Pretty similar for Core2 and AMD64. So you may be right -- as long as you don't have one of those super-long latency instructions just before your rtdsc call, cpuid is probably doing more harm than good. The conventional wisdom could well be wrong.
Aug 28 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 Using cpuid only once doesn't work. The rationale ultimately comes from 
 here, I think:
 
 http://cs.smu.ca/~jamuir/rdtscpm1.pdf
Ah, that makes sense now. Thanks!
Aug 28 2008