digitalmars.D - Array append performance

bearophile (51/51) Aug 22 2008 I have performed another small benchmark of the dynamic array append, si...

Sean Kelly (14/36) Aug 22 2008 Unfortunately, this is expected. On every append the runtime checks to

dsimcha (18/18) Aug 22 2008 I just ran into this yesterday porting C++ code that used push_back() he...

bearophile (9/12) Aug 22 2008 In 2D/nD arrays you repeat such size_t many times, so it's not just 4/8 ...

Robert Jacques (5/8) Aug 22 2008 On a similar note, there's also fast strided arrays. ( see my blog

Jerry Quinn (5/13) Aug 22 2008 For starters, you need 8 bytes on 64 bit systems. I suspect a more impo...

bearophile (4/5) Aug 22 2008 Can you explain why?
superdan (5/20) Aug 22 2008 sorry to always be the one who puts the moose on the table. y'all who wa...

dsimcha (14/16) Aug 22 2008 and data. it's a view of a portion of memory. the slice don't care where...

Sean Kelly (5/21) Aug 23 2008 The Tango runtime is a bit different in that it checks to see if the arr...

Sean Kelly (11/28) Aug 23 2008 Otherwise, it noticably restricts the places I can use built-in arrays.

Steven Schveighoffer (3/45) Aug 23 2008 Unless the slice is at the beginning of an allocated block...

Sean Kelly (4/48) Aug 25 2008 True enough :-) I suppose this is one argument in favor of a capacity

superdan (9/37) Aug 24 2008 then we have a field hanging in there that's only valid sometimes. that ...

Christopher Wright (8/9) Aug 24 2008 Capacity could be implemented by preallocating extra space at the end of...
Sean Kelly (9/39) Aug 25 2008 That's the weird bit. The capacity of any copy of a slice or an array

Denis Koroskin (6/61) Aug 25 2008 Both Java and C# strings aren't supposed to be fast on appending, that's...

Steven Schveighoffer (11/43) Aug 23 2008 I see nothing wrong with having a library solution to this. D and

dsimcha (13/23) Aug 23 2008 You do have a point there. Builtin arrays work well for the typical cas...

Adam D. Ruppe (7/8) Aug 23 2008 I wouldn't rely on it. Those are just interfaces to the platform's nativ...

Steven Schveighoffer (4/10) Aug 23 2008 Although the spec may not say it, any modern OS has a thread safe malloc...

bearophile (7/12) Aug 23 2008 Probably a library solution makes you use even more memory (when you wan...
Bill Baxter (8/16) Aug 23 2008 Personally I'd argue that appending efficiently is important to more

Steven Schveighoffer (9/28) Aug 23 2008 Making all arrays (even ones that you create temporarily and never appen...

bearophile (7/15) Aug 23 2008 In the Python world they solve this problem taking a look at real code. ...

dsimcha (11/17) Aug 23 2008 Definitely a reasonable argument, but just to play devil's advocate, let...

Robert Jacques (20/26) Aug 24 2008 I've written some micro benchmarks to test out the difference between an...

bearophile (7/10) Aug 24 2008 You may want to address some of the things superdan has said.

Robert Jacques (31/38) Aug 24 2008 Yes, stride is a key enabler for integrating 2D/3D/nD matrices with

Benji Smith (20/38) Aug 25 2008 That's very convincing. Seems like stride ought to be in there.

superdan (2/35) Aug 25 2008 ok, i'll bite. what exactly are you so desperately asking for?

Benji Smith (4/7) Aug 25 2008 For strings to be implemented as objects, rather than as arrays of

superdan (3/12) Aug 25 2008 ok i assume u mean class objects. because current arrays are objects. on...

Benji Smith (3/16) Aug 25 2008 I'll start a new thread, so as not to hijack this one further.

Robert Jacques (12/26) Aug 25 2008 This only works if strings are objects as there's an aliasing issue, so ...

bearophile (4/6) Aug 25 2008 Well, D 2.x strings (and their slices) are immutable, so there is no pro...

Denis Koroskin (5/17) Aug 25 2008 I don't follow you. Immutability of the slice implies that the whole arr...

bearophile (12/15) Aug 25 2008 Well, only if immutability is transitive :-)

bearophile (6/9) Aug 25 2008 Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) st...

Robert Jacques (16/27) Aug 25 2008 I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a...

Robert Jacques (7/34) Aug 25 2008 Using these optimizations with const could be done with either a bit-fla...

JAnderson (5/43) Aug 25 2008 But please don't come up with something like C#'s strings where you have...

JAnderson (5/24) Aug 23 2008 Another solution would be to cache the capacity on the stack for the

Lionello Lunesu (17/21) Aug 24 2008 If the others are right, D's probably using std.gc.capacity to constantl...

Lionello Lunesu (22/22) Aug 24 2008 Nevermind. None of that explained why bearophile's program is so slow. T...

bearophile (41/42) Aug 25 2008 Yes, there's something that deserves some improvements, probably at vari...
Robert Jacques (3/9) Aug 25 2008 Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2

bearophile (5/6) Aug 25 2008 For the people interested, here are few slides written by the good Hetti...

JAnderson (3/30) Aug 25 2008 Great stuff!

Walter Bright (63/63) Aug 26 2008 The bugzilla report on this suggests replacing []=[] with memcpy. But

Lionello Lunesu (76/76) Aug 27 2008 The problem is that the timing for the small arrays cannot be trusted as...

Lionello Lunesu (3/4) Aug 27 2008 D'oh! I should have let Don guess it!
bearophile (151/152) Aug 27 2008 I have modified your code a little:

Lionello Lunesu (6/6) Aug 27 2008 --snip--

bearophile (23/27) Aug 27 2008 From more benchmarks I have seen that if you don't use ASM (with prefetc...

Sean Kelly (17/44) Aug 27 2008 Why not just change that to this instead:

bearophile (5/20) Aug 27 2008 I have benchmarked many alternative versions, your one too, but mine is ...

Fawzi Mohamed (2/27) Aug 27 2008

Fawzi Mohamed (16/46) Aug 27 2008 so

bearophile (107/110) Aug 27 2008 ...

Walter Bright (6/10) Aug 28 2008 I suspect that copying the same data over and over again is not

Walter Bright (3/3) Aug 28 2008 Essentially, I'm having a real hard time believing that for the 4 byte

Lionello Lunesu (12/15) Aug 28 2008 I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb"...

Walter Bright (5/21) Aug 28 2008 But for 4 bytes, how much slower could it be than memcpy, which executes...

Lionello Lunesu (14/16) Aug 29 2008 I don't know.

Don (16/28) Aug 28 2008 Using cpuid only once doesn't work. The rationale ultimately comes from

Walter Bright (2/6) Aug 28 2008 Ah, that makes sense now. Thanks!

bearophile <bearophileHUGS lycos.com> writes:

I have performed another small benchmark of the dynamic array append, similar
to one I have performed in the past, a comparison between the C++ push_back of
C++ and the ~= of D, both performed after a "reserve".

To perform the "reserve" in D I have used std.gc.malloc instead of the simpler:
auto a = new int[n];
a.length = 0;
to actually speed up the D code a little, avoiding a (fast but useless) memory
cleaning.

I have used MinGW 4.2.1 and DMD v.1.034.

Compiling the C++ with:
-O3 -s
and the D with:
-O -release -inline

The following results are with n = 100_000_000 (about 400 MB RAM used):
  D:   7.94 s
  C++: 0.60 s  (13.2X)

Note that if n is 10 or even 100 times smaller the situation doesn't change
much:
n = 10_000_000:
  D:   0.83 s
  C++: 0.09 s

Note2: In the D code the final deallocation of the array is fast enough, and
its allocation is fast too (you can see this timing just the for loop in the
middle of the program). The really slow part seems to be the loop itself.

Note3: If a reserve isn't used (that is in both programs the arrays aren't
pre-allocated) the difference in performance becomes smaller, about 8.6X with n
= 100_000_000).

This is the asm generated by g++:
http://codepad.org/hXNDH0se

Bye,
bearophile

---------------------------
C++ code:

#include "stdio.h"
#include <vector>

using namespace std;

int main(int argc, char **argv) {
    int n = argc >= 2 ? atoi(argv[1]) : 1000;

    vector<int> a;
    a.reserve(n);
    for (int i = 0; i < n; ++i)
        a.push_back(i);

    printf("%d\n", a[n-1]);

    return 0;
}

---------------------------
D code:

import std.conv: toInt;
import std.gc: malloc, hasNoPointers;

int main(string[] args) {
    int n = args.length >= 2 ? toInt(args[1]) : 1000;

    int* aux_ptr = cast(int*)malloc(n * int.sizeof);
    hasNoPointers(aux_ptr);
    int[] a = aux_ptr[0 .. 0];

    for (int i; i < n; i++)
        a ~= i;

    printf("%d\n", a[n - 1]);
    return 0;
}

Aug 22 2008

Sean Kelly <sean invisibleduck.org> writes:

bearophile wrote:
 I have performed another small benchmark of the dynamic array append, similar
to one I have performed in the past, a comparison between the C++ push_back of
C++ and the ~= of D, both performed after a "reserve".
 
 To perform the "reserve" in D I have used std.gc.malloc instead of the simpler:
 auto a = new int[n];
 a.length = 0;
 to actually speed up the D code a little, avoiding a (fast but useless) memory
cleaning.
 
 I have used MinGW 4.2.1 and DMD v.1.034.
 
 Compiling the C++ with:
 -O3 -s
 and the D with:
 -O -release -inline
 
 The following results are with n = 100_000_000 (about 400 MB RAM used):
   D:   7.94 s
   C++: 0.60 s  (13.2X)
 
 Note that if n is 10 or even 100 times smaller the situation doesn't change
much:
 n = 10_000_000:
   D:   0.83 s
   C++: 0.09 s

Unfortunately, this is expected.  On every append the runtime checks to 
see if there is enough room in the buffer for the new data, which 
requires a call to gc.capacity().  So best case you're paying for a 
bunch of math and worst case you have to acquire the GC mutex as well.

That said, this isn't really an entirely fair comparison, since you're 
comparing the performance of a structured, local user-level object vs. a 
minimal, language-level feature.  I know that D arrays are supposed to 
be roughly comparable to std::vector or std::string, and they are from a 
semantics perspective, but when the only local data they contain is a 
pointer and a length there's little that can be done to optimize certain 
behavior.  And you've just happened to pinpoint the greatest performance 
problem with D arrays :-)


Sean

Aug 22 2008

dsimcha <dsimcha yahoo.com> writes:

I just ran into this yesterday porting C++ code that used push_back() heavily to
D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
actually faster than the C++ code, but the inner loops are less readable.  For a
whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
capacity field in the array itself?  I think the problem here is the need to
call
gc.capacity(), which can require a lock, as Sean said.  If you have the capacity
field embedded in the array itself, you never need a lock for ~= unless the
array
actually needs to be realloced or is appended to by more than one thread.  I've
written classes before that need fast appending and therefore cache the capacity
value every time the array is reallocated, and it makes it a ton faster even in
single threaded code.  On the other hand, this breaks the ABI for D arrays, but
D2
breaks a lot of backwards compatibility anyhow, so maybe this is still a good
idea.  Does anyone actually write C functions that rely on this ABI and then
call
them from D?  Furthermore, having a capacity field would also allow reserve for
builtin arrays.

Other than the 4 bytes of extra overhead, does anyone see any downside to
including a capacity field?  Also, is there some situation I'm not thinking of
where the 4 bytes actually matter more than the speed gains in question?

Aug 22 2008

bearophile <bearophileHUGS lycos.com> writes:

First of all thank you very much Sean Kelly for your clear explanation of the
situation.

dsimcha:
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

In 2D/nD arrays you repeat such size_t many times, so it's not just 4/8 bytes.
(This particular problem can be solved with rectangular/equal length dynamic
arrays, but it may be too much for a built-in type...).

Are dynamic array appends common? I think they may be used often enough to
justify such memory increase.

I think an experimental approach here may be the best way to answer such
question. I mean creating a branch version of DMD where dynamic arrays have a
3-item long defining struct (12/24) and use it to benchmark some real programs
or benchmarks, to see how the performance in memory/speed of the compiler
changes. How much time/work does it require to create a DMD with such change?

Note that a similar branched DMD can be useful to test the performance of
another change Walter was talking regarding array structs: changing the len-ptr
struct of the arrays to a startptr-stopptr struct.

Bye,
bearophile

Aug 22 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Fri, 22 Aug 2008 22:23:04 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:
 Note that a similar branched DMD can be useful to test the performance  
 of another change Walter was talking regarding array structs: changing  
 the len-ptr struct of the arrays to a startptr-stopptr struct.

On a similar note, there's also fast strided arrays. ( see my blog  
http://dobbscodetalk.com/index.php?option=com_myblog&show=Making-strided-arrays-as-fast-as-native-a
rays.html&Itemid=29  
)

Aug 22 2008

Jerry Quinn <jlquinn optonline.net> writes:

dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
 
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

For starters, you need 8 bytes on 64 bit systems.  I suspect a more important
reason might be making slice operations more difficult to implement correctly.

That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.  Otherwise, it noticably restricts the places I
can use built-in arrays.

An alternative might be a convenient simple library wrapper that provides these
kinds of dynamic operations, while still allowing easy access to the underlying
array.  I'm thinking prehaps std.array might include an appender object that
you create, append to, and then grab the underlying array from, once you're
done using it.  It's not hard to write, but I'm sure it would be a welcome
addition to the library, even if core arrays don't give us this.

BTW, std.array seems to be missing documentation.

Aug 22 2008

bearophile <bearophileHUGS lycos.com> writes:

Jerry Quinn:
I suspect a more important reason might be making slice operations more
difficult to implement correctly.<

Can you explain why?

Bye,
bearophile

Aug 22 2008

superdan <super dan.org> writes:

Jerry Quinn Wrote:

 dsimcha Wrote:
 
 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
 
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

 
 For starters, you need 8 bytes on 64 bit systems.  I suspect a more important
reason might be making slice operations more difficult to implement correctly.
 
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.  Otherwise, it noticably restricts the places I
can use built-in arrays.

sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in the array are off base.

today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a portion of memory. the slice don't care where memory
came from. capacity is part of the memory that slices come from. it would be
shared by several slices. and all of a sudden you realize capacity can't be a
field in the slice.

can't be a field just before the actual data in the slice either. for obvious
reasons.

 An alternative might be a convenient simple library wrapper that provides
these kinds of dynamic operations, while still allowing easy access to the
underlying array.  I'm thinking prehaps std.array might include an appender
object that you create, append to, and then grab the underlying array from,
once you're done using it.  It's not hard to write, but I'm sure it would be a
welcome addition to the library, even if core arrays don't give us this.

that makes more sense. a sort of in-memory stream thing.

Aug 22 2008

dsimcha <dsimcha yahoo.com> writes:

== Quote from superdan (super dan.org)'s article
 sorry to always be the one who puts the moose on the table. y'all who want to

make capacity a field in the array are off base.
 today each slice has length and data. why. because all a slice is *is* length

and data. it's a view of a portion of memory. the slice don't care where memory
came from. capacity is part of the memory that slices come from. it would be
shared by several slices. and all of a sudden you realize capacity can't be a
field in the slice.

A capacity field would work fine with slicing if, whenever you take a slice into
an array, the capacity field of the slice is set to zero implicitly, even though
the length is longer than zero.  This is in effect how it works now.  Try
calling
std.gc.capacity() on a pointer to a slice.  It will always return 0 b/c a slice
doesn't point to the start of a block.  This means that if you try to increase
the
size of a slice it will always be reallocated.  Simply setting the capacity
field
of slices to 0 when they are created would replicate this in arrays w/ a
capacity
field.

Aug 22 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from superdan (super dan.org)'s article
 sorry to always be the one who puts the moose on the table. y'all who want to

 make capacity a field in the array are off base.
 today each slice has length and data. why. because all a slice is *is* length

 and data. it's a view of a portion of memory. the slice don't care where memory
 came from. capacity is part of the memory that slices come from. it would be
 shared by several slices. and all of a sudden you realize capacity can't be a
 field in the slice.
 A capacity field would work fine with slicing if, whenever you take a slice
into
 an array, the capacity field of the slice is set to zero implicitly, even
though
 the length is longer than zero.  This is in effect how it works now.  Try
calling
 std.gc.capacity() on a pointer to a slice.  It will always return 0 b/c a slice
 doesn't point to the start of a block.  This means that if you try to increase
the
 size of a slice it will always be reallocated.  Simply setting the capacity
field
 of slices to 0 when they are created would replicate this in arrays w/ a
capacity
 field.

The Tango runtime is a bit different in that it checks to see if the array is a
slice
explicitly--it seems safer than relying on the GC to always return a zero
capacity
for an interior pointer.


Sean

Aug 23 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

 For starters, you need 8 bytes on 64 bit systems.  I suspect a more important
reason might be


making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.


Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in

the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a

portion of memory. the slice don't care where memory came from. capacity is
part of the memory that
slices come from. it would be shared by several slices. and all of a sudden you
realize capacity can't be
a field in the slice.

I'm not arguing in favor of a capacity field, but it could be a field in the
slice as long as it's the same
value as the length.  For example, it's legal to append to a slice in D right
now, but doing so always
reallocates because resizing a slice in place would be bad.


Sean

Aug 23 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Sean Kelly" wrote
 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() 
 heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code 
 is now
 actually faster than the C++ code, but the inner loops are less 
 readable.  For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not 
 include the

 Other than the 4 bytes of extra overhead, does anyone see any 
 downside to
 including a capacity field?  Also, is there some situation I'm not 
 thinking of
 where the 4 bytes actually matter more than the speed gains in 
 question?

 For starters, you need 8 bytes on 64 bit systems.  I suspect a more 
 important reason might be


 making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do 
 iterative appends to arrays fairly often.


 Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who 
 want to make capacity a field in

 the array are off base.
 today each slice has length and data. why. because all a slice is *is* 
 length and data. it's a view of a

 portion of memory. the slice don't care where memory came from. capacity 
 is part of the memory that
 slices come from. it would be shared by several slices. and all of a 
 sudden you realize capacity can't be
 a field in the slice.

 I'm not arguing in favor of a capacity field, but it could be a field in 
 the slice as long as it's the same
 value as the length.  For example, it's legal to append to a slice in D 
 right now, but doing so always
 reallocates because resizing a slice in place would be bad.

Unless the slice is at the beginning of an allocated block...

-Steve

Aug 23 2008

Sean Kelly <sean invisibleduck.org> writes:

Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() 
 heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code 
 is now
 actually faster than the C++ code, but the inner loops are less 
 readable.  For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not 
 include the

 Other than the 4 bytes of extra overhead, does anyone see any 
 downside to
 including a capacity field?  Also, is there some situation I'm not 
 thinking of
 where the 4 bytes actually matter more than the speed gains in 
 question?

 For starters, you need 8 bytes on 64 bit systems.  I suspect a more 
 important reason might be


 making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do 
 iterative appends to arrays fairly often.


 Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who 
 want to make capacity a field in

 the array are off base.
 today each slice has length and data. why. because all a slice is *is* 
 length and data. it's a view of a

 portion of memory. the slice don't care where memory came from. capacity 
 is part of the memory that
 slices come from. it would be shared by several slices. and all of a 
 sudden you realize capacity can't be
 a field in the slice.

 I'm not arguing in favor of a capacity field, but it could be a field in 
 the slice as long as it's the same
 value as the length.  For example, it's legal to append to a slice in D 
 right now, but doing so always
 reallocates because resizing a slice in place would be bad.

 
 Unless the slice is at the beginning of an allocated block...

True enough :-)  I suppose this is one argument in favor of a capacity 
field.


Sean

Aug 25 2008

superdan <super dan.org> writes:

Sean Kelly Wrote:

 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

 For starters, you need 8 bytes on 64 bit systems.  I suspect a more important
reason might be


 making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.


 Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in

 the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a

 portion of memory. the slice don't care where memory came from. capacity is
part of the memory that
 slices come from. it would be shared by several slices. and all of a sudden
you realize capacity can't be
 a field in the slice.
 
 I'm not arguing in favor of a capacity field, but it could be a field in the
slice as long as it's the same
 value as the length.  For example, it's legal to append to a slice in D right
now, but doing so always
 reallocates because resizing a slice in place would be bad.

then we have a field hanging in there that's only valid sometimes. that won't
quite work. will it be copyable? like when you copy a slice to another. what
happens with the capacity field?

at any rate. i find it amused that aside from u nobody even missed a beat on
the discussion on adding the capacity. this is pulp fiction alright.

a: "how about them capacity fields. what if we add them."
b: "would be cool."
c: "would suck goat balls."
d: "capacity don't make sense because this and that and the other."
a, b, c: "yeah, but we're still contemplating the ifs."

wake up people. there is first the problem capacity can't be implemented. u
gotta figure out some approximation first. then discuss the gorram ifs.

Aug 24 2008

Christopher Wright <dhasenan gmail.com> writes:

superdan wrote:
 wake up people. there is first the problem capacity can't be implemented. u
gotta figure out some approximation first. then discuss the gorram ifs.

Capacity could be implemented by preallocating extra space at the end of 
an array, but aside from that, you're right; the garbage collector could 
use memory between the end of the array and the end of its capacity 
after the array was created, and there's no way you could update all the 
references to that array and change the stored capacity.

As for slices, the capacity of a slice would be equal to its length, so 
that isn't an issue.

Aug 24 2008

Sean Kelly <sean invisibleduck.org> writes:

superdan wrote:
 Sean Kelly Wrote:
 
 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

 For starters, you need 8 bytes on 64 bit systems.  I suspect a more important
reason might be


 making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do iterative
appends to arrays fairly often.


 Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who want to
make capacity a field in

 the array are off base.
 today each slice has length and data. why. because all a slice is *is* length
and data. it's a view of a

 portion of memory. the slice don't care where memory came from. capacity is
part of the memory that
 slices come from. it would be shared by several slices. and all of a sudden
you realize capacity can't be
 a field in the slice.

 I'm not arguing in favor of a capacity field, but it could be a field in the
slice as long as it's the same
 value as the length.  For example, it's legal to append to a slice in D right
now, but doing so always
 reallocates because resizing a slice in place would be bad.

 
 then we have a field hanging in there that's only valid sometimes. that won't
quite work. will it be copyable? like when you copy a slice to another. what
happens with the capacity field?

That's the weird bit.  The capacity of any copy of a slice or an array 
should be the length of the copy, so copying couldn't be done via memcpy 
on the array ref.  More simply, you could probably just default the 
capacity to zero, but I don't like the idea of violating the logical 
invariant for array refs just to make copying / initialization easier. 
And the weirdness of this suggests to me (and to you as well I presume) 
that perhaps adding a capacity field isn't proper for built-in arrays.


Sean

Aug 25 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Mon, 25 Aug 2008 22:25:42 +0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 superdan wrote:
 Sean Kelly Wrote:

 == Quote from superdan (super dan.org)'s article
 Jerry Quinn Wrote:
 dsimcha Wrote:

 I just ran into this yesterday porting C++ code that used  
 push_back() heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code  
 is now
 actually faster than the C++ code, but the inner loops are less  
 readable.  For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not  
 include the

 Other than the 4 bytes of extra overhead, does anyone see any  
 downside to
 including a capacity field?  Also, is there some situation I'm not  
 thinking of
 where the 4 bytes actually matter more than the speed gains in  
 question?

 For starters, you need 8 bytes on 64 bit systems.  I suspect a more  
 important reason might be


 making slice operations more difficult to implement correctly.
 That said, I think I still favor it, since I find I need to do  
 iterative appends to arrays fairly often.


 Otherwise, it noticably restricts the places I can use built-in arrays.
 sorry to always be the one who puts the moose on the table. y'all who  
 want to make capacity a field in

 the array are off base.
 today each slice has length and data. why. because all a slice is  
 *is* length and data. it's a view of a

 portion of memory. the slice don't care where memory came from.  
 capacity is part of the memory that
 slices come from. it would be shared by several slices. and all of a  
 sudden you realize capacity can't be
 a field in the slice.

 I'm not arguing in favor of a capacity field, but it could be a field  
 in the slice as long as it's the same
 value as the length.  For example, it's legal to append to a slice in  
 D right now, but doing so always
 reallocates because resizing a slice in place would be bad.

  then we have a field hanging in there that's only valid sometimes.  
 that won't quite work. will it be copyable? like when you copy a slice  
 to another. what happens with the capacity field?

 That's the weird bit.  The capacity of any copy of a slice or an array  
 should be the length of the copy, so copying couldn't be done via memcpy  
 on the array ref.  More simply, you could probably just default the  
 capacity to zero, but I don't like the idea of violating the logical  
 invariant for array refs just to make copying / initialization easier.  
 And the weirdness of this suggests to me (and to you as well I presume)  
 that perhaps adding a capacity field isn't proper for built-in arrays.


 Sean


why they both have StringBuilders (which has a capacity field). The same  
can be applied to other array types, a library ArrayBuilder class is a  
good solution, I think.

Aug 25 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"dsimcha" wrote
I just ran into this yesterday porting C++ code that used push_back() 
heavily to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable. 
 For a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include 
 the
 capacity field in the array itself?  I think the problem here is the need 
 to call
 gc.capacity(), which can require a lock, as Sean said.  If you have the 
 capacity
 field embedded in the array itself, you never need a lock for ~= unless 
 the array
 actually needs to be realloced or is appended to by more than one thread. 
 I've
 written classes before that need fast appending and therefore cache the 
 capacity
 value every time the array is reallocated, and it makes it a ton faster 
 even in
 single threaded code.  On the other hand, this breaks the ABI for D 
 arrays, but D2
 breaks a lot of backwards compatibility anyhow, so maybe this is still a 
 good
 idea.  Does anyone actually write C functions that rely on this ABI and 
 then call
 them from D?  Furthermore, having a capacity field would also allow 
 reserve for
 builtin arrays.

 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not 
 thinking of
 where the 4 bytes actually matter more than the speed gains in question?

I see nothing wrong with having a library solution to this.  D and 
especially D2 is to the point where syntax of custom structs is really easy 
to use.

The problem is that many times I don't append to an array or slice, why 
should I have to accept the cost of an extra 4-8 byte field for every slice 
and array that isn't going to change size (which I would argue is used more 
often)?

The only thing missing right now is opImplicitCast (to go back and forth 
from a normal array).

-Steve

Aug 23 2008

dsimcha <dsimcha yahoo.com> writes:

== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 I see nothing wrong with having a library solution to this.  D and
 especially D2 is to the point where syntax of custom structs is really easy
 to use.
 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every slice
 and array that isn't going to change size (which I would argue is used more
 often)?
 The only thing missing right now is opImplicitCast (to go back and forth
 from a normal array).
 -Steve

You do have a point there.  Builtin arrays work well for the typical cases, but
no
one array type could work well for every conceivable specialized case,
especially
if you count the case of needing minimal overhead as one possible specialized
case.

I've wanted to write my own specialized array classes to deal with these unusual
cases for a while, but the lack of opImplicitCast is the only thing that
prevents
me from doing so, since I would want them to be able to play nice with builtin
arrays.

A little off topic, but somewhat relevant:  I wrote a scope array class for when
you need a to allocate large temporary array for something like a mathematical
computation or a merge sort, without constantly triggering GC or running into
false pointer issues.  It bypasses the GC, calls C's malloc, realloc and free,
and, since it's a scope class, the memory is freed by the dtor.  Does anyone
know
if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?

Aug 23 2008

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Sat, Aug 23, 2008 at 05:50:08PM +0000, dsimcha wrote:
 Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?

I wouldn't rely on it. Those are just interfaces to the platform's native
C library, which may or may not be thread safe - the C spec doesn't say
anything about threads at all (if I remember correctly).

-- 
Adam D. Ruppe
http://arsdnet.net

Aug 23 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Adam D. Ruppe" wrote
 On Sat, Aug 23, 2008 at 05:50:08PM +0000, dsimcha wrote:
 Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are 
 threadsafe?

 I wouldn't rely on it. Those are just interfaces to the platform's native
 C library, which may or may not be thread safe - the C spec doesn't say
 anything about threads at all (if I remember correctly).

Although the spec may not say it, any modern OS has a thread safe malloc. 
You should have no problems.

-Steve

Aug 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:
 I see nothing wrong with having a library solution to this. [...]
 The problem is that many times I don't append to an array or slice, why 
 should I have to accept the cost of an extra 4-8 byte field for every slice 
 and array that isn't going to change size (which I would argue is used more 
 often)?

Probably a library solution makes you use even more memory (when you want to
use it), and makes dynamic arrays less handy to use in D (but it may be better
than nothing).

Array append is an operation rather common in higher-level programs, for
example in Python it's one of the most common operations, while the up-front
creation of empty arrays is quite uncommon. I don't know how much common the
push_back is in C++ programs.

(In successive benchmarks I have seen that array append of integers with Psyco
is about as fast as appending integers in D. A difference is that in Python
integers are objects, around 20 bytes long).

My idea was to do this change in a D branch, and see how programs like your
ones behave in the modified version: if your code at runtime has to use 100-200
bytes more than before, then I think this price may be accepted on modern PCs.
On the other hand a large number of such size_t fields may be wasted when you
manage tons of short strings, or when you manage many smallish nD arrays. That
problem with nD arrays can be solved with a library that manages rectangular nD
arrays, that avoids wasting of space for the lengths of each row/etc. This
means that another solution is to do the opposite thing, instead of adding a
library support for efficiently extensible arrays, such support may be of
dynamically allocated fixed-size arrays... :-)

Bye,
bearophile

Aug 23 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Sun, Aug 24, 2008 at 2:16 AM, Steven Schveighoffer
<schveiguy yahoo.com> wrote:
 "dsimcha" wrote

 I see nothing wrong with having a library solution to this.  D and
 especially D2 is to the point where syntax of custom structs is really easy
 to use.

 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every slice
 and array that isn't going to change size (which I would argue is used more
 often)?

Personally I'd argue that appending efficiently is important to more
programs than than saving 4-8 bytes here and there.  So that would
suggest that lean-and-mean fixed-sized arrays should be the library
solution, and arrays with a capacity field should be the built-in
type.

--bb

Aug 23 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Bill Baxter" wrote
 On Sun, Aug 24, 2008 at 2:16 AM, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:
 "dsimcha" wrote

 I see nothing wrong with having a library solution to this.  D and
 especially D2 is to the point where syntax of custom structs is really 
 easy
 to use.

 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every 
 slice
 and array that isn't going to change size (which I would argue is used 
 more
 often)?

 Personally I'd argue that appending efficiently is important to more
 programs than than saving 4-8 bytes here and there.  So that would
 suggest that lean-and-mean fixed-sized arrays should be the library
 solution, and arrays with a capacity field should be the built-in
 type.

Making all arrays (even ones that you create temporarily and never append 
to) have extra data to copy will add up performance-wise.  I think there are 
only a few spots in any of my code that I use the append operator.  So 
'important' it is, but common, I would argue it is not.  Of course, I only 
present anectodal evidence :)

There's also the advantage of backwards compatibility, and not having to 
change the D spec.

-Steve

Aug 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:
 Making all arrays (even ones that you create temporarily and never append 
 to) have extra data to copy will add up performance-wise.

Benchmarks of new features help estimate such performance tradoffs in real
programs.


 I think there are 
 only a few spots in any of my code that I use the append operator.  So 
 'important' it is, but common, I would argue it is not.  Of course, I only 
 present anectodal evidence :)

In the Python world they solve this problem taking a look at real code. So some
projects in the dsource repository can be mined to estimate how much often the
appending is used in real programs.
Note that a std lib library is meant to be very efficient, because people use
it as a source of building blocks to write normal programs, so I presume
Tango/Phobos will have less array appends than the average user-level code, so
they may be excluded from such database.


 There's also the advantage of backwards compatibility, and not having to 
 change the D spec.

It's probably meant as a change for D 2 only.

Bye,
bearophile

Aug 23 2008

dsimcha <dsimcha yahoo.com> writes:

== Quote from Bill Baxter (wbaxter gmail.com)'s article
 Personally I'd argue that appending efficiently is important to more
 programs than than saving 4-8 bytes here and there.  So that would
 suggest that lean-and-mean fixed-sized arrays should be the library
 solution, and arrays with a capacity field should be the built-in
 type.
 --bb

Definitely a reasonable argument, but just to play devil's advocate, let's
assume
that opImplicitCast is coming eventually.  Most of the library implementations
of
arrays will presumably be implicitly castable to builtins, and library
functions,
especially standard lib functions, will probably use the builtin type as
arguments.  Implicit casting to builtin, in my view, is essential to using
library
implementations of arrays, because otherwise you have a mess of a zillion
different incompatible array library implementations a la C++.  If you have to
create a dummy capacity field every time you implicitly cast your array to a
builtin anyhow, that might defeat the purpose of a lightweight library
implementation at times.

Aug 23 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Sat, 23 Aug 2008 13:16:26 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:
 The problem is that many times I don't append to an array or slice, why
 should I have to accept the cost of an extra 4-8 byte field for every  
 slice
 and array that isn't going to change size (which I would argue is used  
 more
 often)?

I've written some micro benchmarks to test out the difference between an  
8-byte struct (i.e. the current arrays) and a 16-byte struct (ptr, length,  
capacity, stride) with respect to function call performance. I wrote 2  
benchmarks: 1) an actual function call (which calculated the exponential)  
and 2) an optimized struct copy (i.e. simulating pass by value).
In release -o or debug mode, (1) had performance decreases of -5.1%, 9.8%  
and 1.8% for passing 1, 2 and 3 structs respectively. When -inline is  
enabled this changes to 0.5%, 2.5% and 5% respectively. For the (2)  
benchmark, I looked at using mmx and sse2 instructions to accelerate the  
struct copy. Using sse2 instructions provided a 15% performance gain using  
unaligned structs and an 84% gain using aligned structs. 16-byte SSE2  
copies beat single 8-byte MMX copies by 9% (both aligned), though dual MMX  
(i.e. a 16-byte copy) still beat the built-in copy by 61% (again aligned).  
(All in -release -o -inline, on a Core2 CPU)

So the performance cost, even for very small workloads, looks to be pretty  
minor and with some work performance could actually increase. (I'm  
assuming the use of sse/mmx/AltiVec for memcopy acceleration would be  
enabled by a complier flag on 32-bit)

Aug 24 2008

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:
 I've written some micro benchmarks to test out the difference between an  
 8-byte struct (i.e. the current arrays) and a 16-byte struct (ptr, length,  
 capacity, stride)

You may want to address some of the things superdan has said.

And, why the stride? I presume it's useful if you want a 2D matrix. It may be
useful, but maybe it's overkill, so you may want to benchmark just (ptr,
length, capacity) too.
Another useful benchmark is to test a program that manages tons of short
strings too, like this one:
http://en.wikipedia.org/wiki/D_language#Example_2

Bye,
bearophile

Aug 24 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.

Yes, stride is a key enabler for integrating 2D/3D/nD matrices with  
built-in arrays.
Other pros:
- Fast O(1) reversing of an array. Looping through an array in reverse was  
a common enough operation to create an additional keyword and language  
feature (foreach_reverse) specifically for it.
- Slicing out a struct element. i.e. slice the red channel out of an image  
or the first name from an Employee array.
- Accessing every Nth element. This is a common MATLAB operation and often  
used in visualization (i.e. a quick and simple resolution adjustment for a  
line plot). Also, some parallel algorithms benefit from interleaving data.  
This is a fairly standard multi-threading technique, particularly when  
processing data[x] may take longer than processing data[y] (though it's  
not the first thing I'd try).

And lastly, SSE works on 16-byte structs, so why not use the extra 4-bytes?

 It may be useful, but maybe it's overkill, so you may want to benchmark  
 just (ptr, length, capacity) too.

For a 12-byte struct, either {ptr, length, capacity} or {ptr, length,  
stride}, the performance decrease for -release -o -inline was -0.7%, 0.3%  
and 0.2% for 1, 2 and 3 arrays respectively and -0%, 8% and 0.8%  
respectively for debug.

 Another useful benchmark is to test a program that manages tons of short  
 strings too, like this one:  
 http://en.wikipedia.org/wiki/D_language#Example_2

Okay, looking just at the main loop (writefln is really slow) the  
performance cost of the extra 4-bytes for a capacity field is about 1.7%  
(naive cost, i.e no SSE, etc). I didn't actually implement a capacity  
field, so this is a worst case.

 You may want to address some of the things superdan has said.

Superdan raised the good point that if the GC is aware of array length at  
creation, it might compact the capacity to allocate something else. To the  
best of my knowledge (which isn't saying much) both the Phobos and Tango  
GC are block base, and so are un-aware of the array length. Also, Sean  
Kelly wrote the Tango GC and isn't raising any red flags, which is  
probably a good sign it's a valid optimization.

Aug 24 2008

Benji Smith <dlanguage benjismith.net> writes:

Robert Jacques wrote:
 On Sun, 24 Aug 2008 22:14:36 -0400, bearophile 
 <bearophileHUGS lycos.com> wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.

 
 Yes, stride is a key enabler for integrating 2D/3D/nD matrices with 
 built-in arrays.
 Other pros:
 - Fast O(1) reversing of an array. Looping through an array in reverse 
 was a common enough operation to create an additional keyword and 
 language feature (foreach_reverse) specifically for it.
 - Slicing out a struct element. i.e. slice the red channel out of an 
 image or the first name from an Employee array.
 - Accessing every Nth element. This is a common MATLAB operation and 
 often used in visualization (i.e. a quick and simple resolution 
 adjustment for a line plot). Also, some parallel algorithms benefit from 
 interleaving data. This is a fairly standard multi-threading technique, 
 particularly when processing data[x] may take longer than processing 
 data[y] (though it's not the first thing I'd try).

That's very convincing. Seems like stride ought to be in there.

But it also got me thinking about what other 4-byte value could be 
stuffed into that extra space (satisfying the 16-bit alignment 
requirement), and I can't help but yearn for hashcode memoization.

Strings are, arguably, the most common types of arrays in most programs. 
  And strings are, arguably the most common key-type in associative 
arrays. And every time you perform an associative array lookup, you have 
to re-compute the hashcode (which is silly for immutable strings, since 
the hashcode can't ever change).

In languages where Strings are objects (please!!! please, Walter!!!) you 
can do things like memoize the hashcode to avoid redundant future 
calculations. If Strings are mutable, you can mark a "dirty" bit when 
the string is changed, lazily recalculating a memoized hashcode based on 
that dirty bit. But only if you have a field for memoizing the hashcode 
in the first place.

So, while a stride field seems very useful for some very interesting 
special cases, to me it seems like the more compelling general case is 
for a memoized hashcode.

--benji

Aug 25 2008

superdan <super dan.org> writes:

Benji Smith Wrote:

 Robert Jacques wrote:
 On Sun, 24 Aug 2008 22:14:36 -0400, bearophile 
 <bearophileHUGS lycos.com> wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.

 
 Yes, stride is a key enabler for integrating 2D/3D/nD matrices with 
 built-in arrays.
 Other pros:
 - Fast O(1) reversing of an array. Looping through an array in reverse 
 was a common enough operation to create an additional keyword and 
 language feature (foreach_reverse) specifically for it.
 - Slicing out a struct element. i.e. slice the red channel out of an 
 image or the first name from an Employee array.
 - Accessing every Nth element. This is a common MATLAB operation and 
 often used in visualization (i.e. a quick and simple resolution 
 adjustment for a line plot). Also, some parallel algorithms benefit from 
 interleaving data. This is a fairly standard multi-threading technique, 
 particularly when processing data[x] may take longer than processing 
 data[y] (though it's not the first thing I'd try).

 
 That's very convincing. Seems like stride ought to be in there.
 
 But it also got me thinking about what other 4-byte value could be 
 stuffed into that extra space (satisfying the 16-bit alignment 
 requirement), and I can't help but yearn for hashcode memoization.
 
 Strings are, arguably, the most common types of arrays in most programs. 
   And strings are, arguably the most common key-type in associative 
 arrays. And every time you perform an associative array lookup, you have 
 to re-compute the hashcode (which is silly for immutable strings, since 
 the hashcode can't ever change).
 
 In languages where Strings are objects (please!!! please, Walter!!!)

ok, i'll bite. what exactly are you so desperately asking for?

Aug 25 2008

Benji Smith <dlanguage benjismith.net> writes:

superdan wrote:
 In languages where Strings are objects (please!!! please, Walter!!!)

 
 ok, i'll bite. what exactly are you so desperately asking for?

For strings to be implemented as objects, rather than as arrays of 
characters.

--benji

Aug 25 2008

superdan <super dan.org> writes:

Benji Smith Wrote:

 superdan wrote:
 In languages where Strings are objects (please!!! please, Walter!!!)

 
 ok, i'll bite. what exactly are you so desperately asking for?

 
 For strings to be implemented as objects, rather than as arrays of 
 characters.
 
 --benji

ok i assume u mean class objects. because current arrays are objects. only not
of class type.

and what advantages would that have. let me tell u of one disadvantage fer
starters. with builtin arrays of characters you can efficiently implement
classes. with classes you can't efficiently implement arrays of characters.

Aug 25 2008

Benji Smith <dlanguage benjismith.net> writes:

superdan wrote:
 Benji Smith Wrote:
 
 superdan wrote:
 In languages where Strings are objects (please!!! please, Walter!!!)

 ok, i'll bite. what exactly are you so desperately asking for?

 For strings to be implemented as objects, rather than as arrays of 
 characters.

 --benji

 
 ok i assume u mean class objects. because current arrays are objects. only not
of class type.
 
 and what advantages would that have. let me tell u of one disadvantage fer
starters. with builtin arrays of characters you can efficiently implement
classes. with classes you can't efficiently implement arrays of characters.

I'll start a new thread, so as not to hijack this one further.

--benji

Aug 25 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 25 Aug 2008 11:41:51 -0400, Benji Smith <dlanguage benjismith.net>  
wrote:

 I can't help but yearn for hashcode memoization.

 Strings are, arguably, the most common types of arrays in most programs.  
   And strings are, arguably the most common key-type in associative  
 arrays. And every time you perform an associative array lookup, you have  
 to re-compute the hashcode (which is silly for immutable strings, since  
 the hashcode can't ever change).

 In languages where Strings are objects (please!!! please, Walter!!!) you  
 can do things like memoize the hashcode to avoid redundant future  
 calculations. If Strings are mutable, you can mark a "dirty" bit when  
 the string is changed, lazily recalculating a memoized hashcode based on  
 that dirty bit. But only if you have a field for memoizing the hashcode  
 in the first place.

This only works if strings are objects as there's an aliasing issue, so it  
won't work for D arrays. Consider:

string a = "foo";
string b = a;
int[string] counts;
counts[b] = 1;
a[0] = 'b'; // b's dirty bit isn't set
counts[b]++; // calls counts["foo"]++ and not counts["boo"]++

 So, while a stride field seems very useful for some very interesting  
 special cases,

Well, I would argue that games, scientific computing, throughput computing  
and (lightweight) databases aren't special cases.

Aug 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:
 This only works if strings are objects as there's an aliasing issue, so it  
 won't work for D arrays.

Well, D 2.x strings (and their slices) are immutable, so there is no problem. I
agree that for mutable data structures (that can be modified from references)
storing the hash value is dangerous. A compromise may be to have mutable arrays
but immutable slices, that may solve the problem you show. Slices are small, so
having them immutable is probably not too much bad.

Bye,
bearophile

Aug 25 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Mon, 25 Aug 2008 21:51:33 +0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Robert Jacques:
 This only works if strings are objects as there's an aliasing issue, so  
 it
 won't work for D arrays.

 Well, D 2.x strings (and their slices) are immutable, so there is no  
 problem. I agree that for mutable data structures (that can be modified  
 from references) storing the hash value is dangerous. A compromise may  
 be to have mutable arrays but immutable slices, that may solve the  
 problem you show. Slices are small, so having them immutable is probably  
 not too much bad.

 Bye,
 bearophile

I don't follow you. Immutability of the slice implies that the whole array  
is immutable, too.
Besides, how often do you take a hash of an array other than string?

Aug 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Denis Koroskin:
 I don't follow you.

After posting my message I have realized that my expressed thoughts are wrong,
I am sorry. The topic looks simple, but it is not for me. This newsgroup is
useful to learn intellectual humility better.


Immutability of the slice implies that the whole array is immutable, too.<

Well, only if immutability is transitive :-)
Storing the hash value for a mutable data structure that can have references is
dangerous, better to avoid doing it.
But if a data structure is transitively immutable then storing its hash value
may be positive if you use lot of hashing (hash value can be computed lazily
only when required, a value of 0 may mean not computed yet. Python uses -1 for
such purpose).
Finding some compromise between those two extrema seems tricky to me.
D 2.0 strings are immutable, so storing their hash value may be positive. The
presence of such stored hash value may even be activated or not by a global
compilation constant given to the compiler :o)


 Besides, how often do you take a hash of an array other than string?

In D I have never done it, so I presume not often.
(In Python I use the hash value of immutable arrays often, because they are
more handy than D structs.)

Simple related note: capacity of the array must not be used to compute its hash
value.

Bye,
bearophile

Aug 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Benji Smith:
 So, while a stride field seems very useful for some very interesting 
 special cases, to me it seems like the more compelling general case is 
 for a memoized hashcode.

Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) strings
enjoy having a hash value there. The two cases are very distinct, with rare
mixed cases.
So an ugly idea: if the 4th array is 1D the field can be the hash value, if
it's 2D it can contain the stride ;-) (But if superdan is right, all this
discussion may need to be re-built on different basis).
On the other hand, the capacity field isn't much useful for 2D arrays (because
you usually append to 1D arrays), so you end with just 3 fields with 2D arrays
and 4 fields for 1D arrays :-) So, if you want to keep both structures of len 4
(assuming pointers and size_t having the same size) you can add a second
stride, useful for 3D arrays too ;-)

Bye,
bearophile

Aug 25 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 25 Aug 2008 12:51:30 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:
 Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D)  
 strings enjoy having a hash value there. The two cases are very  
 distinct, with rare mixed cases.
 So an ugly idea: if the 4th array is 1D the field can be the hash value,  
 if it's 2D it can contain the stride ;-) (But if superdan is right, all  
 this discussion may need to be re-built on different basis).
 On the other hand, the capacity field isn't much useful for 2D arrays  
 (because you usually append to 1D arrays), so you end with just 3 fields  
 with 2D arrays and 4 fields for 1D arrays :-) So, if you want to keep  
 both structures of len 4 (assuming pointers and size_t having the same  
 size) you can add a second stride, useful for 3D arrays too ;-)

I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a  
length and a stride per dimension in order to support array slices,  
interfacing to both C and/or Fortran code bases, etc.
So a ND array looks like
{
     T*           ptr;
     size_t[N]    lengths;
     ptrdiff_t[N] byte_strides;
}

However, regarding your idea, as caching the has makes sense for invarient  
arrays and capacity would mostly be used during string bulding, you could  
overload the field based on type. However, this means neither optimization  
works for const arrays, which might not be a good thing as it encorages  
code bloat.

Aug 25 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 25 Aug 2008 18:45:30 -0400, Robert Jacques <sandford jhu.edu>  
wrote:

 On Mon, 25 Aug 2008 12:51:30 -0400, bearophile  
 <bearophileHUGS lycos.com> wrote:
 Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D)  
 strings enjoy having a hash value there. The two cases are very  
 distinct, with rare mixed cases.
 So an ugly idea: if the 4th array is 1D the field can be the hash  
 value, if it's 2D it can contain the stride ;-) (But if superdan is  
 right, all this discussion may need to be re-built on different basis).
 On the other hand, the capacity field isn't much useful for 2D arrays  
 (because you usually append to 1D arrays), so you end with just 3  
 fields with 2D arrays and 4 fields for 1D arrays :-) So, if you want to  
 keep both structures of len 4 (assuming pointers and size_t having the  
 same size) you can add a second stride, useful for 3D arrays too ;-)

 I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a  
 length and a stride per dimension in order to support array slices,  
 interfacing to both C and/or Fortran code bases, etc.
 So a ND array looks like
 {
      T*           ptr;
      size_t[N]    lengths;
      ptrdiff_t[N] byte_strides;
 }

 However, regarding your idea, as caching the has makes sense for  
 invarient arrays and capacity would mostly be used during string  
 bulding, you could overload the field based on type. However, this means  
 neither optimization works for const arrays, which might not be a good  
 thing as it encorages code bloat.

Using these optimizations with const could be done with either a bit-flag  
(in length) or utilizing the block nature of the GC which leads to  
capacities with (at least 4 I think) zero low order bits.

P.S. byte_strides should probably go above lengths as it's a better of  
memory layout.

Aug 25 2008

JAnderson <ask me.com> writes:

Benji Smith wrote:
 Robert Jacques wrote:
 On Sun, 24 Aug 2008 22:14:36 -0400, bearophile 
 <bearophileHUGS lycos.com> wrote:
 And, why the stride? I presume it's useful if you want a 2D matrix.

 Yes, stride is a key enabler for integrating 2D/3D/nD matrices with 
 built-in arrays.
 Other pros:
 - Fast O(1) reversing of an array. Looping through an array in reverse 
 was a common enough operation to create an additional keyword and 
 language feature (foreach_reverse) specifically for it.
 - Slicing out a struct element. i.e. slice the red channel out of an 
 image or the first name from an Employee array.
 - Accessing every Nth element. This is a common MATLAB operation and 
 often used in visualization (i.e. a quick and simple resolution 
 adjustment for a line plot). Also, some parallel algorithms benefit 
 from interleaving data. This is a fairly standard multi-threading 
 technique, particularly when processing data[x] may take longer than 
 processing data[y] (though it's not the first thing I'd try).

 
 That's very convincing. Seems like stride ought to be in there.
 
 But it also got me thinking about what other 4-byte value could be 
 stuffed into that extra space (satisfying the 16-bit alignment 
 requirement), and I can't help but yearn for hashcode memoization.
 
 Strings are, arguably, the most common types of arrays in most programs. 
  And strings are, arguably the most common key-type in associative 
 arrays. And every time you perform an associative array lookup, you have 
 to re-compute the hashcode (which is silly for immutable strings, since 
 the hashcode can't ever change).
 
 In languages where Strings are objects (please!!! please, Walter!!!) you 
 can do things like memoize the hashcode to avoid redundant future 
 calculations. If Strings are mutable, you can mark a "dirty" bit when 
 the string is changed, lazily recalculating a memoized hashcode based on 
 that dirty bit. But only if you have a field for memoizing the hashcode 
 in the first place.


to use stringbuilder to get around the very slow string pooling.  I 
think lazily evaluating it would be a good idea.

-Joel

Aug 25 2008

JAnderson <ask me.com> writes:

dsimcha wrote:
 I just ran into this yesterday porting C++ code that used push_back() heavily
to
 D.  Ended up rewriting the inner loops to pre-allocate.  The D code is now
 actually faster than the C++ code, but the inner loops are less readable.  For
a
 whopping overhead of 4 extra bytes per array (on 32-bit), why not include the
 capacity field in the array itself?  I think the problem here is the need to
call
 gc.capacity(), which can require a lock, as Sean said.  If you have the
capacity
 field embedded in the array itself, you never need a lock for ~= unless the
array
 actually needs to be realloced or is appended to by more than one thread.  I've
 written classes before that need fast appending and therefore cache the
capacity
 value every time the array is reallocated, and it makes it a ton faster even in
 single threaded code.  On the other hand, this breaks the ABI for D arrays,
but D2
 breaks a lot of backwards compatibility anyhow, so maybe this is still a good
 idea.  Does anyone actually write C functions that rely on this ABI and then
call
 them from D?  Furthermore, having a capacity field would also allow reserve for
 builtin arrays.
 
 Other than the 4 bytes of extra overhead, does anyone see any downside to
 including a capacity field?  Also, is there some situation I'm not thinking of
 where the 4 bytes actually matter more than the speed gains in question?

Another solution would be to cache the capacity on the stack for the 
array so its only called once.  That wouldn't be quite as fast but 
wouldn't break the ABI.

-Joel

Aug 23 2008

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g8ni6o$12k2$1 digitalmars.com...
 Note2: In the D code the final deallocation of the array is fast enough, 
 and its allocation is fast too (you can see this timing just the for loop 
 in the middle of the program). The really slow part seems to be the loop 
 itself.

If the others are right, D's probably using std.gc.capacity to constantly 
check the capacity of the array. Interestingly, the Phobos GC caches the 
pointer/size in a call to "capacity", so that requesting the capacity of the 
same pointer immediately returns its capacity without any look-up. See 
function sizeOfNoSync, in internal/gc/gcx.d. Additionally, if there are no 
other threads, the GC lock is not taken, so capacity() should be O(1).

It would be interesting to check whether the cached pointer/size in the GC 
is actually being used. If it is not, it would result in a call to 
findSize() which in turn calls findPool(). findPool() is O(n), n = number of 
pools. The pooltable is already sorted, so this can be made O(log n). 
findSize() then does a linear search to check the size of the alloced block, 
which is O(m), m = size of block.

Result: getting the capacity of a pointer is O(m+n). There's a lot of room 
for improvement right there.

L.

Aug 24 2008

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

Nevermind. None of that explained why bearophile's program is so slow. This 
does:

int[] ~= int is using _d_arrayappendcT (from internal/gc/gc.d), which 
appends an int by appending its four bytes! Blasphemy! Now we have great 
super-duper-efficient array operations using SSEx but a simple append of an 
int to an int[] is using a "rep movs" to append 4 bytes. X-(

Also, the append code first checks whether the array should be resized, and 
if not, jumps to the actual append code. The common case should not be 
branching: { if (too small) goto resize; append: ... return; resize: ... 
goto append; }

The use of _d_arrayappendcT is hardcoded in the D compiler (e2ir.c) so I 
don't know how to create different code paths for different primitive 
(sizes) other than to add many ifs to _d_arrayappendcT. IIRC, TypeInfo 
contained a pointer to a method for fast array comparisons. Shouldn't a 
similar thing be added for array appends?

BTW, the capacity look-up was indeed using the cached values. Meaning, it's 
not "capacity" we should be focussing on, but the general append code. 
However, if you were to append to two different arrays in a similar loop, 
the performance would be terrible, since each array would overwrite the 
cache with its own ptr/size, resulting in the slow O(m+n) for each capacity 
look-up.

L.

Aug 24 2008

bearophile <bearophileHUGS lycos.com> writes:

Lionello Lunesu:
However, if you were to append to two different arrays in a similar loop, the
performance would be terrible, since each array would overwrite the cache with
its own ptr/size, resulting in the slow O(m+n) for each capacity look-up.<

Yes, there's something that deserves some improvements, probably at various
levels. I have modified the benchmarks in the way you have explained (see
below), with n = 4 millions the running time of the C++ version is barely
testable (0.04 s), the D version takes 32.2 s (I have had not enough patience
to see it finish with n = 10 millions).

Bye,
bearophile

------------------
C++:

#include "stdio.h"
#include <vector>

using namespace std;

int main(int argc, char **argv) {
    int n = argc >= 2 ? atoi(argv[1]) : 1000;

    vector<int> a1;
    a1.reserve(n);
    vector<int> a2;
    a2.reserve(n);
    for (int i = 0; i < n; ++i) {
        a1.push_back(i);
        a2.push_back(i);
    }

    printf("%d %d\n", a1[n - 1], a2[n - 1]);
    return 0;
}

------------------
C++:

import std.conv: toInt;
import std.gc: malloc, hasNoPointers;

int main(string[] args) {
    int n = args.length >= 2 ? toInt(args[1]) : 1000;

    int* aux_ptr1 = cast(int*)malloc(n * int.sizeof);
    hasNoPointers(aux_ptr1);
    int[] a1 = aux_ptr1[0 .. 0];

    int* aux_ptr2 = cast(int*)malloc(n * int.sizeof);
    hasNoPointers(aux_ptr2);
    int[] a2 = aux_ptr2[0 .. 0];

    for (int i; i < n; i++) {
        a1 ~= i;
        a2 ~= i;
    }

    printf("%d %d\n", a1[n - 1], a2[n - 1]);
    return 0;
}

Aug 25 2008

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 25 Aug 2008 01:41:51 -0400, Lionello Lunesu  
<lionello lunesu.remove.com> wrote:

 BTW, the capacity look-up was indeed using the cached values. Meaning,  
 it's not "capacity" we should be focussing on, but the general append  
 code. However, if you were to append to two different arrays in a  
 similar loop, the performance would be terrible, since each array would  
 overwrite the cache with its own ptr/size, resulting in the slow O(m+n)  
 for each capacity look-up.

Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2

Aug 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:
 Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2

For the people interested, here are few slides written by the good Hettinger
that show how dict/set/lists work in Python (it shows how the array append too
works):
http://www.pycon.it/static/pycon2/slides/containers.ppt

Bye,
bearophile

Aug 25 2008

JAnderson <ask me.com> writes:

Lionello Lunesu wrote:
 Nevermind. None of that explained why bearophile's program is so slow. 
 This does:
 
 int[] ~= int is using _d_arrayappendcT (from internal/gc/gc.d), which 
 appends an int by appending its four bytes! Blasphemy! Now we have great 
 super-duper-efficient array operations using SSEx but a simple append of 
 an int to an int[] is using a "rep movs" to append 4 bytes. X-(
 
 Also, the append code first checks whether the array should be resized, 
 and if not, jumps to the actual append code. The common case should not 
 be branching: { if (too small) goto resize; append: ... return; resize: 
 ... goto append; }
 
 The use of _d_arrayappendcT is hardcoded in the D compiler (e2ir.c) so I 
 don't know how to create different code paths for different primitive 
 (sizes) other than to add many ifs to _d_arrayappendcT. IIRC, TypeInfo 
 contained a pointer to a method for fast array comparisons. Shouldn't a 
 similar thing be added for array appends?
 
 BTW, the capacity look-up was indeed using the cached values. Meaning, 
 it's not "capacity" we should be focussing on, but the general append 
 code. However, if you were to append to two different arrays in a 
 similar loop, the performance would be terrible, since each array would 
 overwrite the cache with its own ptr/size, resulting in the slow O(m+n) 
 for each capacity look-up.
 
 L.

Great stuff!

-Joel

Aug 25 2008

Walter Bright <newshound1 digitalmars.com> writes:

The bugzilla report on this suggests replacing []=[] with memcpy. But 
this benchmark shows this isn't a consistent win:

import std.c.string;

long timer()
{
     asm
     {   naked                   ;
         rdtsc                   ;
         ret                     ;
     }
}

void test1(byte[] a, byte[] b)
{
     a[] = b[];
}

void test2(byte[] a, byte[] b)
{
     memcpy(a.ptr, b.ptr, a.length);
}

void main()
{
     for (int i = 4; i < 100_000_000; i *= 2)
     {
         auto a = new byte[i];
         auto b = new byte[i];

         auto start = timer();
         test1(a, b);
         auto end = timer();
         auto r1 = end - start;

         start = timer();
         test2(a, b);
         end = timer();
         auto r2 = end - start;

         printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
     }
}

Running this program produces:

i:        4,    []=[]:      144,        memcpy:      568
i:        8,    []=[]:      144,        memcpy:      300
i:       16,    []=[]:      172,        memcpy:      324
i:       32,    []=[]:      204,        memcpy:      344
i:       64,    []=[]:      288,        memcpy:      276
i:      128,    []=[]:      288,        memcpy:      272
i:      256,    []=[]:      352,        memcpy:      364
i:      512,    []=[]:      372,        memcpy:      424
i:     1024,    []=[]:      552,        memcpy:      564
i:     2048,    []=[]:      684,        memcpy:     1384
i:     4096,    []=[]:     1344,        memcpy:     1772
i:     8192,    []=[]:     2900,        memcpy:     3216
i:    16384,    []=[]:     5292,        memcpy:     5472
i:    32768,    []=[]:    11496,        memcpy:    10388
i:    65536,    []=[]:    29484,        memcpy:    27480
i:   131072,    []=[]:   110464,        memcpy:    67784
i:   262144,    []=[]:   655580,        memcpy:   562400
i:   524288,    []=[]:  1204124,        memcpy:  1107256
i:  1048576,    []=[]:  2364588,        memcpy:  2272552
i:  2097152,    []=[]:  4516440,        memcpy:  4417764
i:  4194304,    []=[]:  8996992,        memcpy:  8817176
i:  8388608,    []=[]: 20223908,        memcpy: 17717748
i: 16777216,    []=[]: 35774952,        memcpy: 36094652
i: 33554432,    []=[]: 71008068,        memcpy: 71246896
i: 67108864,    []=[]: 142982284,       memcpy: 145473300

There's not much of a consistent conclusion to be drawn from that.

Aug 26 2008

Lionello Lunesu <lio lunesu.remove.com> writes:

The problem is that the timing for the small arrays cannot be trusted as 
there's more overhead than actual code being tested. I've changed your 
code by doing each test 100_000_000/i times. I've also added 'cpuid' for 
synchronization, as Don suggested:

import std.c.string;

long timer()
{
     asm
     {   naked                   ;
         push ECX;
         push EBX;
         mov EAX, 0              ;
         cpuid                   ;
         pop EBX;
         pop ECX;
         rdtsc                   ;

         ret                     ;
     }
}

void test1(byte[] a, byte[] b)
{
     a[] = b[];
}

void test2(byte[] a, byte[] b)
{
     memcpy(a.ptr, b.ptr, a.length);
}

void main()
{
     for (int i = 4; i < 100_000_000; i *= 2)
     {
         auto b = new byte[i];
         auto a = new byte[i];

         test1(a,b);//pre-cache

         auto count = 100_000_000 / i;
         auto start = timer();
         while (count--)
             test1(a, b);
         auto end = timer();
         auto r1 = end - start;

         count = 100_000_000 / i;
         start = timer();
         while (count--)
             test2(a, b);
         end = timer();
         auto r2 = end - start;

         printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
     }
}

Results on my Core2 Duo:

i:        4,    []=[]: 2752539660,      memcpy: 429537312
i:        8,    []=[]: 1004458383,      memcpy: 229984623
i:       16,    []=[]: 492054948,       memcpy: 128590560
i:       32,    []=[]: 309960657,       memcpy: 135892809
i:       64,    []=[]: 204971832,       memcpy: 79507359
i:      128,    []=[]: 131925789,       memcpy: 52222626
i:      256,    []=[]: 73768896,        memcpy: 40184559
i:      512,    []=[]: 43827948,        memcpy: 28040859
i:     1024,    []=[]: 29030985,        memcpy: 21078432
i:     2048,    []=[]: 21645027,        memcpy: 18953199
i:     4096,    []=[]: 18099945,        memcpy: 16022259
i:     8192,    []=[]: 15905007,        memcpy: 15399036
i:    16384,    []=[]: 15567399,        memcpy: 15795747
i:    32768,    []=[]: 32244507,        memcpy: 30654063
i:    65536,    []=[]: 30824595,        memcpy: 31286475
i:   131072,    []=[]: 32248179,        memcpy: 30950757
i:   262144,    []=[]: 34880868,        memcpy: 31166667
i:   524288,    []=[]: 31220802,        memcpy: 35813277
i:  1048576,    []=[]: 42463341,        memcpy: 42121386
i:  2097152,    []=[]: 122218776,       memcpy: 118839150
i:  4194304,    []=[]: 111864294,       memcpy: 110864988
i:  8388608,    []=[]: 110600226,       memcpy: 107638083
i: 16777216,    []=[]: 97014645,        memcpy: 95789574
i: 33554432,    []=[]: 79344468,        memcpy: 78277374
i: 67108864,    []=[]: 77437629,        memcpy: 78478767

Conclusion: memcpy wins big time for i <= 8192.

Aug 27 2008

Lionello Lunesu <lio lunesu.remove.com> writes:

Lionello Lunesu wrote:
 Results on my Core2 Duo:

D'oh! I should have let Don guess it!

L.

Aug 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Lionello Lunesu:
 Conclusion: memcpy wins big time for i <= 8192.

I have modified your code a little:

import std.c.string: memcpy;

long timer() {
    asm {
        naked;
        push ECX;
        push EBX;
        mov EAX, 0;
        cpuid;
        pop EBX;
        pop ECX;
        rdtsc;
        ret;
    }
}

void main() {
    const int N = 500_000_000;
    alias ubyte DataType;

    for (int i = 4; i < N; i *= 2) {
        auto b = new DataType[i];
        auto a = new DataType[i];

        a[] = b[]; // pre-cache

        auto count = N / i;
        auto start = timer();
        while (count--)
            a[] = b[];
        auto end = timer();
        auto t1 = end - start;

        count = N / i;
        start = timer();
        while (count--)
            memcpy(a.ptr, b.ptr, a.length * DataType.sizeof);
        end = timer();
        auto t2 = end - start;

        count = N / i;
        start = timer();
        while (count--)
            for (int j; j < i; j++)
                a[j] = b[j];
        end = timer();
        auto t3 = end - start;

        count = N / i;
        start = timer();
        while (count--) {
            auto pa = a.ptr;
            auto pb = b.ptr;
            int nloop = i;
            while (nloop--)
                *pa++ = *pb++;
        }
        end = timer();
        auto t4 = end - start;

        printf("i: %8d,\t[]=[]: %8lld, \tt1/t2: %.2f  \tt1/t3: %.2f  \tt1/t4:
%.2f\n",
               i, t1, cast(float)t1/t2, cast(float)t1/t3, cast(float)t1/t4);
    }
}


The timings:

N = 500_000_000, DataType = ubyte:
i:        4,    []=[]: 9919922030,      t1/t2: 4.64     t1/t3: 5.64     t1/t4:
5.27
i:        8,    []=[]: 4957640530,      t1/t2: 4.63     t1/t3: 3.59     t1/t4:
2.93
i:       16,    []=[]: 2478916670,      t1/t2: 4.10     t1/t3: 2.08     t1/t4:
1.55
i:       32,    []=[]: 1537003970,      t1/t2: 2.28     t1/t3: 1.40     t1/t4:
0.99
i:       64,    []=[]: 1019418780,      t1/t2: 2.59     t1/t3: 0.97     t1/t4:
0.67
i:      128,    []=[]: 659281070,       t1/t2: 2.54     t1/t3: 0.60     t1/t4:
0.58
i:      256,    []=[]: 364781180,       t1/t2: 1.90     t1/t3: 0.35     t1/t4:
0.34
i:      512,    []=[]: 217207030,       t1/t2: 1.56     t1/t3: 0.21     t1/t4:
0.21
i:     1024,    []=[]: 143970290,       t1/t2: 1.37     t1/t3: 0.14     t1/t4:
0.14
i:     2048,    []=[]: 107283300,       t1/t2: 1.21     t1/t3: 0.11     t1/t4:
0.11
i:     4096,    []=[]: 88801730,        t1/t2: 1.11     t1/t3: 0.08     t1/t4:
0.08
i:     8192,    []=[]: 79605200,        t1/t2: 1.06     t1/t3: 0.07     t1/t4:
0.07
i:    16384,    []=[]: 77892400,        t1/t2: 0.99     t1/t3: 0.07     t1/t4:
0.07
i:    32768,    []=[]: 156926780,       t1/t2: 1.03     t1/t3: 0.16     t1/t4:
0.16
i:    65536,    []=[]: 154375350,       t1/t2: 1.01     t1/t3: 0.15     t1/t4:
0.15
i:   131072,    []=[]: 154758850,       t1/t2: 1.01     t1/t3: 0.15     t1/t4:
0.15
i:   262144,    []=[]: 155175500,       t1/t2: 1.01     t1/t3: 0.15     t1/t4:
0.15
i:   524288,    []=[]: 159647160,       t1/t2: 1.01     t1/t3: 0.16     t1/t4:
0.16
i:  1048576,    []=[]: 832590470,       t1/t2: 1.00     t1/t3: 0.79     t1/t4:
0.79
i:  2097152,    []=[]: 783795470,       t1/t2: 1.00     t1/t3: 0.76     t1/t4:
0.75
i:  4194304,    []=[]: 828801230,       t1/t2: 1.00     t1/t3: 0.79     t1/t4:
0.78
i:  8388608,    []=[]: 813813550,       t1/t2: 1.00     t1/t3: 0.78     t1/t4:
0.78
i: 16777216,    []=[]: 772143790,       t1/t2: 1.00     t1/t3: 0.77     t1/t4:
0.76
i: 33554432,    []=[]: 772579250,       t1/t2: 1.00     t1/t3: 0.78     t1/t4:
0.78
i: 67108864,    []=[]: 840918700,       t1/t2: 1.00     t1/t3: 0.84     t1/t4:
0.83
i: 134217728,   []=[]: 644072850,       t1/t2: 1.00     t1/t3: 0.77     t1/t4:
0.77
i: 268435456,   []=[]: 446295660,       t1/t2: 0.99     t1/t3: 0.79     t1/t4:
0.78

So for DataType = ubyte:
  i = 4: T3 better
  i in [8, 32768] T2 better
  i > 32768: T1 and T2 equally good


N = 500_000_000, DataType = uint:
i:        4,    []=[]: 9789704410,      t1/t2: 3.72     t1/t3: 6.00     t1/t4:
4.87
i:        8,    []=[]: 6088621920,      t1/t2: 2.25     t1/t3: 4.62     t1/t4:
3.46
i:       16,    []=[]: 4049677930,      t1/t2: 2.58     t1/t3: 3.31     t1/t4:
2.48
i:       32,    []=[]: 2620962840,      t1/t2: 2.53     t1/t3: 2.31     t1/t4:
1.67
i:       64,    []=[]: 1451602120,      t1/t2: 1.89     t1/t3: 1.39     t1/t4:
0.94
i:      128,    []=[]: 867937320,       t1/t2: 1.56     t1/t3: 0.79     t1/t4:
0.74
i:      256,    []=[]: 575548140,       t1/t2: 1.37     t1/t3: 0.55     t1/t4:
0.53
i:      512,    []=[]: 428628560,       t1/t2: 1.21     t1/t3: 0.42     t1/t4:
0.41
i:     1024,    []=[]: 355175800,       t1/t2: 1.11     t1/t3: 0.35     t1/t4:
0.35
i:     2048,    []=[]: 318880440,       t1/t2: 1.06     t1/t3: 0.32     t1/t4:
0.31
i:     4096,    []=[]: 307818740,       t1/t2: 1.02     t1/t3: 0.30     t1/t4:
0.30
i:     8192,    []=[]: 628989600,       t1/t2: 1.03     t1/t3: 0.62     t1/t4:
0.62
i:    16384,    []=[]: 618088340,       t1/t2: 1.01     t1/t3: 0.61     t1/t4:
0.61
i:    32768,    []=[]: 620081160,       t1/t2: 1.01     t1/t3: 0.62     t1/t4:
0.61
i:    65536,    []=[]: 622162940,       t1/t2: 1.01     t1/t3: 0.62     t1/t4:
0.61
i:   131072,    []=[]: 633820200,       t1/t2: 1.00     t1/t3: 0.61     t1/t4:
0.61
i:   262144,    []=[]: 3239626180,      t1/t2: 1.00     t1/t3: 1.05     t1/t4:
1.04
i:   524288,    []=[]: 3145529870,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.04
i:  1048576,    []=[]: 3319243820,      t1/t2: 1.00     t1/t3: 1.05     t1/t4:
1.05
i:  2097152,    []=[]: 3292952240,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.03
i:  4194304,    []=[]: 3159184090,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.04
i:  8388608,    []=[]: 3261583900,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.04
i: 16777216,    []=[]: 3462604700,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.03
i: 33554432,    []=[]: 2971217250,      t1/t2: 0.99     t1/t3: 1.05     t1/t4:
1.05
i: 67108864,    []=[]: 3136278270,      t1/t2: 0.99     t1/t3: 1.03     t1/t4:
1.03

So for DataType = uint:
  i in [4, 16]: T3 better
  i in [32, 8192] T2 better
  i > 16384: T1 and T2 equally good


N = 500_000_000, DataType = ulong:
i:        4,    []=[]: 12176650300,     t1/t2: 2.26     t1/t3: 5.67     t1/t4:
5.10
i:        8,    []=[]: 8095535110,      t1/t2: 2.58     t1/t3: 4.39     t1/t4:
3.69
i:       16,    []=[]: 5241679510,      t1/t2: 2.53     t1/t3: 3.15     t1/t4:
2.49
i:       32,    []=[]: 2904270580,      t1/t2: 1.89     t1/t3: 1.83     t1/t4:
1.41
i:       64,    []=[]: 1733200310,      t1/t2: 1.56     t1/t3: 1.12     t1/t4:
0.85
i:      128,    []=[]: 1149144250,      t1/t2: 1.37     t1/t3: 1.02     t1/t4:
0.70
i:      256,    []=[]: 856914580,       t1/t2: 1.21     t1/t3: 0.80     t1/t4:
0.54
i:      512,    []=[]: 711247630,       t1/t2: 1.11     t1/t3: 0.69     t1/t4:
0.46
i:     1024,    []=[]: 638308500,       t1/t2: 1.06     t1/t3: 0.63     t1/t4:
0.42
i:     2048,    []=[]: 614686480,       t1/t2: 0.99     t1/t3: 0.57     t1/t4:
0.40
i:     4096,    []=[]: 1254471740,      t1/t2: 1.02     t1/t3: 0.70     t1/t4:
0.69
i:     8192,    []=[]: 1238218710,      t1/t2: 1.01     t1/t3: 0.69     t1/t4:
0.68
i:    16384,    []=[]: 1240384550,      t1/t2: 1.01     t1/t3: 0.69     t1/t4:
0.68
i:    32768,    []=[]: 1244624520,      t1/t2: 1.01     t1/t3: 0.69     t1/t4:
0.69
i:    65536,    []=[]: 1272774050,      t1/t2: 1.00     t1/t3: 0.69     t1/t4:
0.69
i:   131072,    []=[]: 6532720530,      t1/t2: 1.00     t1/t3: 1.05     t1/t4:
1.05
i:   262144,    []=[]: 6315104980,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.04
i:   524288,    []=[]: 6663898750,      t1/t2: 1.00     t1/t3: 1.05     t1/t4:
1.05
i:  1048576,    []=[]: 6562860030,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.04
i:  2097152,    []=[]: 6322608930,      t1/t2: 1.00     t1/t3: 1.04     t1/t4:
1.04

So for DataType = ulong:
  i in [4, 16]: T3 better
  i in [32, 4096] T2 better
  i > 8192: T1 and T2 equally good


CPU used: Pentium Dual-Core E2180, with caches 64 KB L1 (32 KB data, 32 KB
program) and 1 MB L2.

But there are ways to go much faster:
http://cdrom.amd.com/devconn/events/gdc_2002_amd.pdf
http://files.rsdn.ru/23380/AMD_block_prefetch_paper.pdf

Bye,
bearophile

Aug 27 2008

Lionello Lunesu <lio lunesu.remove.com> writes:

--snip--

I think it safe to conclude that memcpy (T2) is the fastest all-round 
solution. Only the inlined custom code can beat it.

What's more, to use memcpy only one line in gc.d needs to be changed. 
Both T3 and T4 would need a change in the compiler.

L.

Aug 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Lionello Lunesu:

 I think it safe to conclude that memcpy (T2) is the fastest all-round 
 solution. Only the inlined custom code can beat it.
 What's more, to use memcpy only one line in gc.d needs to be changed. 
 Both T3 and T4 would need a change in the compiler.

From more benchmarks I have seen that if you don't use ASM (with prefetching,
etc) and/or more clever code, the following is the faster from DataType from
ubyte to ulong:

template Range(int stop) {
    static if (stop <= 0)
        alias Tuple!() Range;
    else
        alias Tuple!(Range!(stop-1), stop-1) Range;
}

switch (i) {
    case 0: break;
    case 1: foreach (j; Range!(1)) a[j] = b[j]; break;
    case 2: foreach (j; Range!(2)) a[j] = b[j]; break;
    case 3: foreach (j; Range!(3)) a[j] = b[j]; break;
    case 4: foreach (j; Range!(4)) a[j] = b[j]; break;
    case 5: foreach (j; Range!(5)) a[j] = b[j]; break;
    case 6: foreach (j; Range!(6)) a[j] = b[j]; break;
    case 7: foreach (j; Range!(7)) a[j] = b[j]; break;
    case 8: foreach (j; Range!(8)) a[j] = b[j]; break;
    default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof);
}

A possible disadvantage may come from the fact that such code requires several
instructions, so it increases the traffic in the L1 code cache.

Bye,
bearophile

Aug 27 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Lionello Lunesu:
 I think it safe to conclude that memcpy (T2) is the fastest all-round
 solution. Only the inlined custom code can beat it.
 What's more, to use memcpy only one line in gc.d needs to be changed.
 Both T3 and T4 would need a change in the compiler.

 From more benchmarks I have seen that if you don't use ASM (with prefetching,
etc) and/or more clever code, the following

is the faster from DataType from ubyte to ulong:
 template Range(int stop) {
     static if (stop <= 0)
         alias Tuple!() Range;
     else
         alias Tuple!(Range!(stop-1), stop-1) Range;
 }
 switch (i) {
     case 0: break;
     case 1: foreach (j; Range!(1)) a[j] = b[j]; break;
     case 2: foreach (j; Range!(2)) a[j] = b[j]; break;
     case 3: foreach (j; Range!(3)) a[j] = b[j]; break;
     case 4: foreach (j; Range!(4)) a[j] = b[j]; break;
     case 5: foreach (j; Range!(5)) a[j] = b[j]; break;
     case 6: foreach (j; Range!(6)) a[j] = b[j]; break;
     case 7: foreach (j; Range!(7)) a[j] = b[j]; break;
     case 8: foreach (j; Range!(8)) a[j] = b[j]; break;
     default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof);
 }

Why not just change that to this instead:

switch(i) {
    case 8: a[7] = b[7];
    case 7: a[6] = b[6];
    case 6: a[5] = b[5];
    case 5: a[4] = b[4];
    case 4: a[3] = b[3];
    case 3: a[2] = b[2];
    case 1: a[0] = b[0];
    case 0: break;
    default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
}

The Range stuff is cool and all, but not really necessary :-)


Sean
 A possible disadvantage may come from the fact that such code requires several
instructions, so it increases the traffic in the

L1 code cache.
 Bye,
 bearophile

Aug 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Sean Kelly:
 Why not just change that to this instead:
 
 switch(i) {
     case 8: a[7] = b[7];
     case 7: a[6] = b[6];
     case 6: a[5] = b[5];
     case 5: a[4] = b[4];
     case 4: a[3] = b[3];
     case 3: a[2] = b[2];
     case 1: a[0] = b[0];
     case 0: break;
     default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
 }
 
 The Range stuff is cool and all, but not really necessary :-)

I have benchmarked many alternative versions, your one too, but mine is
significantly faster than your one on the DMD I am using. Don't ask me why.

You can find the Range!([start,] stop[, step]) in my libs, of course ;-)

Bye,
bearophile

Aug 27 2008

Fawzi Mohamed <fmohamed mac.com> writes:

On 2008-08-27 17:20:55 +0200, bearophile <bearophileHUGS lycos.com> said:

 Sean Kelly:
 Why not just change that to this instead:
 
 switch(i) {
 case 8: a[7] = b[7];
 case 7: a[6] = b[6];
 case 6: a[5] = b[5];
 case 5: a[4] = b[4];
 case 4: a[3] = b[3];
 case 3: a[2] = b[2];
 case 1: a[0] = b[0];
 case 0: break;
 default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
 }
 
 The Range stuff is cool and all, but not really necessary :-)

 
 I have benchmarked many alternative versions, your one too, but mine is 
 significantly faster than your one on the DMD I am using. Don't ask me 
 why.

memory ordering probably, looping from 0 to 7 is faster than 7 to 0...

 
 You can find the Range!([start,] stop[, step]) in my libs, of course ;-)
 
 Bye,
 bearophile

Aug 27 2008

Fawzi Mohamed <fmohamed mac.com> writes:

On 2008-08-28 00:22:21 +0200, Fawzi Mohamed <fmohamed mac.com> said:

 On 2008-08-27 17:20:55 +0200, bearophile <bearophileHUGS lycos.com> said:
 
 Sean Kelly:
 Why not just change that to this instead:
 
 switch(i) {
 case 8: a[7] = b[7];
 case 7: a[6] = b[6];
 case 6: a[5] = b[5];
 case 5: a[4] = b[4];
 case 4: a[3] = b[3];
 case 3: a[2] = b[2];
 case 1: a[0] = b[0];
 case 0: break;
 default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
 }
 
 The Range stuff is cool and all, but not really necessary :-)

 
 I have benchmarked many alternative versions, your one too, but mine is 
 significantly faster than your one on the DMD I am using. Don't ask me 
 why.

 
 memory ordering probably, looping from 0 to 7 is faster than 7 to 0...

so
int j=j;
switch(i) {
case 8: a[j] = b[j]; ++j;
case 7: a[j] = b[j]; ++j;
case 6: a[j] = b[j]; ++j;
case 5: a[j] = b[j]; ++j;
case 4: a[j] = b[j]; ++j;
case 3: a[j] = b[j]; ++j;
case 1: a[j] = b[j];
case 0: break;
default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
}

should be better...
Fawzi

 
 
 You can find the Range!([start,] stop[, step]) in my libs, of course ;-)
 
 Bye,
 bearophile

Aug 27 2008

bearophile <bearophileHUGS lycoc.com> writes:

Fawzi Mohamed:
 memory ordering probably, looping from 0 to 7 is faster than 7 to 0...


I don't think that's the problem.


 so

...
 should be better...

It is not, it seems. For reference below some code you can run:


import std.c.string: memcpy;

template Tuple(T...) {
    alias T Tuple;
}

long timer() {
    asm {
        naked;
        push ECX;
        push EBX;
        mov EAX, 0;
        cpuid;
        pop EBX;
        pop ECX;
        rdtsc;
        ret;
    }
}

template Range(int stop) {
    static if (stop <= 0)
        alias Tuple!() Range;
    else
        alias Tuple!(Range!(stop-1), stop-1) Range;
}

void main() {
    const int N = 500_000_000;
    alias uint DataType;

    for (int i = 4; i < N; i *= 2) {
        auto b = new DataType[i];
        auto a = new DataType[i];

        a[] = b[]; // pre-cache

        auto count = N / i;
        auto start = timer();
        while (count--)
            a[] = b[];
        auto end = timer();
        auto t1 = end - start;

        count = N / i;
        start = timer();
        while (count--)
            memcpy(a.ptr, b.ptr, a.length * DataType.sizeof);
        end = timer();
        auto t2 = end - start;

        count = N / i;
        start = timer();
        while (count--)
            for (int j; j < i; j++)
                a[j] = b[j];
        end = timer();
        auto t3 = end - start;

        count = N / i;
        start = timer();
        while (count--) {
            auto pa = a.ptr;
            auto pb = b.ptr;
            int nloop = i;
            while (nloop--)
                *pa++ = *pb++;
        }
        end = timer();
        auto t4 = end - start;

        count = N / i;
        start = timer();
        while (count--) { // apparently the faster version
            switch (i) {
                case 0: break;
                case 1: foreach (j; Range!(1)) a[j] = b[j]; break;
                case 2: foreach (j; Range!(2)) a[j] = b[j]; break;
                case 3: foreach (j; Range!(3)) a[j] = b[j]; break;
                case 4: foreach (j; Range!(4)) a[j] = b[j]; break;
                case 5: foreach (j; Range!(5)) a[j] = b[j]; break;
                case 6: foreach (j; Range!(6)) a[j] = b[j]; break;
                case 7: foreach (j; Range!(7)) a[j] = b[j]; break;
                case 8: foreach (j; Range!(8)) a[j] = b[j]; break;
                default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof);
            }
        }
        end = timer();
        auto t5 = end - start;

        count = N / i;
        start = timer();
        while (count--) {
            int j = i;
            switch(i) {
                case 8: a[j] = b[j]; ++j;
                case 7: a[j] = b[j]; ++j;
                case 6: a[j] = b[j]; ++j;
                case 5: a[j] = b[j]; ++j;
                case 4: a[j] = b[j]; ++j;
                case 3: a[j] = b[j]; ++j;
                case 1: a[j] = b[j];
                case 0: break;
                default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof );
            }
        }
        end = timer();
        auto t6 = end - start;

        printf("i: %8d,\t[]=[]: %8lld, \tt1/t2: %.2f  \tt1/t3: %.2f  \tt1/t4:
%.2f  \tt1/t5: %.2f  \tt1/t6: %.2f\n",
               i, t1, cast(float)t1/t2, cast(float)t1/t3, cast(float)t1/t4,
cast(float)t1/t5,
               cast(float)t1/t6);
    }
}

Bye,
bearophile

Aug 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Lionello Lunesu wrote:
 The problem is that the timing for the small arrays cannot be trusted as 
 there's more overhead than actual code being tested. I've changed your 
 code by doing each test 100_000_000/i times. I've also added 'cpuid' for 
 synchronization, as Don suggested:

I suspect that copying the same data over and over again is not 
representative of actual usage, and so is not giving useful results.

I also do not understand the usage of cpuid here. I've used rdtsc for 
years for profiling purposes, and it has given me results that match the 
clock-on-the-wall results.

Aug 28 2008

Walter Bright <newshound1 digitalmars.com> writes:

Essentially, I'm having a real hard time believing that for the 4 byte 
case, memcpy is faster. Take a look at the actual implementation of 
memcpy, vs the code generated by the compiler.

Aug 28 2008

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g95mh3$2e5i$2 digitalmars.com...
 Essentially, I'm having a real hard time believing that for the 4 byte 
 case, memcpy is faster. Take a look at the actual implementation of 
 memcpy, vs the code generated by the compiler.

I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb" and 
that must be the slowest way there is to copy 4 bytes. In fact, from my demo 
scene experience I've learned to not use those CISC instructions like rep 
and movs* and loop. mov/dec/jnz always proved faster.

However, I do think that it's a good idea to move the []=[] code into the 
run-time library. That way specialized versions can be written, like the 
other array operations. (Maybe it's already in the lib, but I haven't found 
it. I've only seen mention of OPmemcpy in e2ir.c which I assume represents 
the asm code for a memcpy.)

L.

Aug 28 2008

Walter Bright <newshound1 digitalmars.com> writes:

Lionello Lunesu wrote:
 
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g95mh3$2e5i$2 digitalmars.com...
 Essentially, I'm having a real hard time believing that for the 4 byte 
 case, memcpy is faster. Take a look at the actual implementation of 
 memcpy, vs the code generated by the compiler.

 
 I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb" 
 and that must be the slowest way there is to copy 4 bytes. In fact, from 
 my demo scene experience I've learned to not use those CISC instructions 
 like rep and movs* and loop. mov/dec/jnz always proved faster.

But for 4 bytes, how much slower could it be than memcpy, which executes 
a lot of setup code?


 However, I do think that it's a good idea to move the []=[] code into 
 the run-time library. That way specialized versions can be written, like 
 the other array operations. (Maybe it's already in the lib, but I 
 haven't found it. I've only seen mention of OPmemcpy in e2ir.c which I 
 assume represents the asm code for a memcpy.)

OPmemcpy is an operator supported by the compiler back end, it generates 
the movsb's.

Aug 28 2008

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g96sji$2s1a$1 digitalmars.com...
 But for 4 bytes, how much slower could it be than memcpy, which executes a 
 lot of setup code?

I don't know.

I do know that I noticed a speed difference and that you asked me personally 
to do some more timings. And I've provided you with detailed timings, which 
prove, against popular belief, that memcpy is faster than rep movsb. Now 
apparently the conclusion is that A) "copying the same data over and over 
again is not
representative of actual usage", and B) "[you're] having a real hard time 
believing that for the 4 byte
case, memcpy is faster."

I guess this subject is closed, but somehow I'm left with a somewhat 
unsatisfied feeling..

L.

Aug 29 2008

Don <nospam nospam.com.au> writes:

Walter Bright wrote:
 Lionello Lunesu wrote:
 The problem is that the timing for the small arrays cannot be trusted 
 as there's more overhead than actual code being tested. I've changed 
 your code by doing each test 100_000_000/i times. I've also added 
 'cpuid' for synchronization, as Don suggested:

 
 I suspect that copying the same data over and over again is not 
 representative of actual usage, and so is not giving useful results.
 
 I also do not understand the usage of cpuid here. I've used rdtsc for 
 years for profiling purposes, and it has given me results that match the 
 clock-on-the-wall results.

Using cpuid only once doesn't work. The rationale ultimately comes from 
here, I think:

http://cs.smu.ca/~jamuir/rdtscpm1.pdf

But you've got me thinking. So, serialisation only happens with cpuid. 
But, WHEN is lack of serialisation actually a problem?

On my Pentium M, rtdsc takes 13 uops, all using execution port p0. This 
means that anything on the other ports could execute after it.
The only instructions with a latency longer than 13 are div, aam, fdiv, 
the transcendental floating point instructions, and the bizarro 
instructions aam, fbld, fbstp, and cpuid. Pretty similar for Core2 and 
AMD64.

So you may be right -- as long as you don't have one of those super-long 
latency instructions just before your rtdsc call, cpuid is probably 
doing more harm than good.
The conventional wisdom could well be wrong.

Aug 28 2008

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 Using cpuid only once doesn't work. The rationale ultimately comes from 
 here, I think:
 
 http://cs.smu.ca/~jamuir/rdtscpm1.pdf

Ah, that makes sense now. Thanks!

Aug 28 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Array append performance