www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 4412] New: Array capacity growth spikey and the ratio approaches 1.0

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412

           Summary: Array capacity growth spikey and the ratio approaches
                    1.0
           Product: D
           Version: 2.032
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: druntime
        AssignedTo: sean invisibleduck.org
        ReportedBy: acehreli yahoo.com


--- Comment #0 from Ali Cehreli <acehreli yahoo.com> 2010-07-01 10:33:06 PDT ---
There are spikes in the array capacity growth. Additionally, the
capacity/length ratio approaches 1.0; as a result, the complexity of the append
operation would not be 'amortized constant' for large length.

import std.stdio;
import std.array;

void main()
{
    int[] a;
    size_t oldCapacity;

    foreach (i; 0 .. 100_000_000) {
        a ~= i;

        if (a.capacity != oldCapacity) {
            writeln("length: ", a.length,
                    " capacity: ", capacity(a),
                    " ratio: ", cast(double)capacity(a) / oldCapacity);
            oldCapacity = capacity(a);
        }
    }
}

Produces:

length: 1 capacity: 3 ratio: inf        please ignore this one ;)
length: 4 capacity: 7 ratio: 2.33333
length: 8 capacity: 15 ratio: 2.14286
length: 16 capacity: 31 ratio: 2.06667
length: 32 capacity: 63 ratio: 2.03226
length: 64 capacity: 127 ratio: 2.01587
length: 128 capacity: 255 ratio: 2.00787
length: 256 capacity: 509 ratio: 1.99608
length: 512 capacity: 1021 ratio: 2.00589
length: 1022 capacity: 2045 ratio: 2.00294
length: 2046 capacity: 3069 ratio: 1.50073
length: 3070 capacity: 4093 ratio: 1.33366
length: 4094 capacity: 5117 ratio: 1.25018
...
length: 1153022 capacity: 1154045 ratio: 1.00089
length: 1154046 capacity: 1155069 ratio: 1.00089
length: 1155070 capacity: 3153917 ratio: 2.7305    <-- spike
length: 3153918 capacity: 3154941 ratio: 1.00032
length: 3154942 capacity: 3155965 ratio: 1.00032
...
length: 4741118 capacity: 4742141 ratio: 1.00022
length: 4742142 capacity: 4743165 ratio: 1.00022
length: 4743166 capacity: 12333053 ratio: 2.60017  <-- spike
length: 12333054 capacity: 12334077 ratio: 1.00008
length: 12334078 capacity: 12335101 ratio: 1.00008
... (there are more spikes later on)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 01 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Ali Cehreli <acehreli yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |acehreli yahoo.com


--- Comment #1 from Ali Cehreli <acehreli yahoo.com> 2010-07-01 10:34:39 PDT ---
Forgot to mention that the compiler is dmd 2.047

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 01 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Steven Schveighoffer <schveiguy yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |sean invisibleduck.org
            Version|2.032                       |D2
         AssignedTo|sean invisibleduck.org      |schveiguy yahoo.com


--- Comment #2 from Steven Schveighoffer <schveiguy yahoo.com> 2010-07-01
11:25:43 PDT ---
I'll look into it.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 01 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Ali Cehreli <acehreli yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |http://www.mail-archive.com
                   |                            |/digitalmars-d puremagic.co
                   |                            |m/msg33113.html


--- Comment #3 from Ali Cehreli <acehreli yahoo.com> 2010-07-02 09:16:11 PDT ---
The following is an excerpt from the newsgroup thread that's in the URL field.

 Steven Schveighoffer wrote:
  > newCapacity isn't called when setting the length explicitly, because
  > it's not considered to be an append.

 You are right.

 Looks like I was successful in locating the culprit. :) I took liberty to
write ++a.length and also outputted the number of elements that are in reserve
at each allocation:

 import std.stdio;
 import std.array;

 void main()
 {
      int[] a;
      size_t oldCapacity;

      foreach (i; 0 .. 100_000_000) {
          ++a.length;
          a[$-1] = i;

          if (capacity(a) != oldCapacity) {
              writeln("length: ", a.length,
                      " capacity: ", capacity(a),
                      " ratio: ", cast(double)capacity(a) / oldCapacity,
                      " reserve: ", capacity(a) - a.length);
              oldCapacity = capacity(a);
          }
      }
 }

 Now the spikes are gone, and the allocations asymptote at once-per-1024
elements for an int array:

 length: 1 capacity: 3 ratio: inf reserve: 2
 length: 4 capacity: 7 ratio: 2.33333 reserve: 3
 length: 8 capacity: 15 ratio: 2.14286 reserve: 7
 length: 16 capacity: 31 ratio: 2.06667 reserve: 15
 length: 32 capacity: 63 ratio: 2.03226 reserve: 31
 length: 64 capacity: 127 ratio: 2.01587 reserve: 63
 length: 128 capacity: 255 ratio: 2.00787 reserve: 127
 length: 256 capacity: 509 ratio: 1.99608 reserve: 253
 length: 512 capacity: 1021 ratio: 2.00589 reserve: 509
 length: 1022 capacity: 2045 ratio: 2.00294 reserve: 1023
 length: 2046 capacity: 3069 ratio: 1.50073 reserve: 1023
 length: 3070 capacity: 4093 ratio: 1.33366 reserve: 1023
 ...

 In the long run, there is one allocation per 1024 elements. That still feels
like amortized constant, but considering that relocation times would be O(N), I
think reserve should still grow with N, but maybe not too much. (I am not sure
about this last point at all.)
4x1024 = 4096 == page size. The newCapacity function is supposed to reduce the effect of this phenomenon. i.e., when appending, it's assumed you are going to continue appending, so to reduce the overhead, the amount added to the memory block is supposed to be related to the total number of bytes already allocated. It doesn't seem like the function is working very well... It could have been something I did, or it could have always been broken :) I honestly did not examine the code at all, I just only ever read the comments, assuming it works. Please add these comments to the bug report, it will help when I look at it. I don't have a lot of time to look at it right now, so I'm afraid I'll forget what we were doing later :) -Steve -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 02 2010
prev sibling next sibling parent reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc


--- Comment #4 from bearophile_hugs eml.cc 2010-07-02 09:31:02 PDT ---
This is a simplified version of that code:


import std.stdio: writeln;
void main() {
    int[] array;
    size_t oldCapacity;

    foreach (i; 0 .. 20_000) {
        array ~= i;
        size_t newCapacity = array.capacity();
        if (newCapacity != oldCapacity) {
            writeln(cast(double)newCapacity / oldCapacity);
            oldCapacity = newCapacity;
        }
    }
}


It prints:

inf
2.33333
2.14286
2.06667
2.03226
2.01587
2.00787
1.99608
2.00589
2.00294
1.50073
1.33366
1.25018
1.20012
1.16675
1.14292
1.12505
1.11115
1.10003
1.09093
1.08335
1.07694
1.07144
1.06668
1.06251
1.05883
1.05556
1.05264

This output shows that there is a bug: to allow an O(1) amortized append the
size of the array has to grow geometrically.

So I suggest to use a geometric growth factor of 2 until the memory block is
"large enough", something like 200MB or 300MB. And after such absolute memory
limit then use a smaller geometric growth factor, like 1.3, to avoid wasting
too much memory.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 02 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
d-bugmail puremagic.com wrote:
 http://d.puremagic.com/issues/show_bug.cgi?id=4412
 
 
 bearophile_hugs eml.cc changed:
 
            What    |Removed                     |Added
 ----------------------------------------------------------------------------
                  CC|                            |bearophile_hugs eml.cc
 
 
 --- Comment #4 from bearophile_hugs eml.cc 2010-07-02 09:31:02 PDT ---
 This is a simplified version of that code:
 
 
 import std.stdio: writeln;
 void main() {
     int[] array;
     size_t oldCapacity;
 
     foreach (i; 0 .. 20_000) {
         array ~= i;
         size_t newCapacity = array.capacity();
         if (newCapacity != oldCapacity) {
             writeln(cast(double)newCapacity / oldCapacity);
             oldCapacity = newCapacity;
         }
     }
 }
 
 
 It prints:
 
 inf
 2.33333
 2.14286
 2.06667
 2.03226
 2.01587
 2.00787
 1.99608
 2.00589
 2.00294
 1.50073
 1.33366
 1.25018
 1.20012
 1.16675
 1.14292
 1.12505
 1.11115
 1.10003
 1.09093
 1.08335
 1.07694
 1.07144
 1.06668
 1.06251
 1.05883
 1.05556
 1.05264
 
 This output shows that there is a bug: to allow an O(1) amortized append the
 size of the array has to grow geometrically.
...when it is actually being move. Per my previous message - you may want to evaluate and print the ratio only when the array is effectively relocated. Andrei
Jul 02 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #5 from bearophile_hugs eml.cc 2010-07-02 09:48:24 PDT ---
This is a modified program, after a suggestion by Andrei:


import std.stdio: writeln;
void main() {
    int[] array;
    int* old_ptr;

    foreach (i; 0 .. 100_000_000) {
        array ~= i;
        int* new_ptr = array.ptr;
        if (array.ptr != old_ptr) {
            old_ptr = array.ptr;
            writeln(cast(double)array.capacity() / array.length);
        }
    }
}


It outputs:
3
1.75
1.875
1.9375
1.96875
1.98438
1.99219
1.98828
1.99414
3.00001
2.7305
2.60017
2.48002
2.37

It shows that when relocation happens the capacity() grows significantly. But
that grow factor looks quite large.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 02 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #6 from Steven Schveighoffer <schveiguy yahoo.com> 2010-07-02
10:30:19 PDT ---
OK, here's the deal:

When the current block can be extended into the next page, it is.  This does
not take into account any scaling factors.  That is, if you ask for one more
byte, and a page can be lumped onto the end, then only 1 page is added.  This
explains the behavior most of the time.

If a page cannot be added, then it uses the newCapacity function.  However, I
think this function may return too high a value.

Essentially, the function looks like this:

100 + (1000L * size) / log2plus1(newcap);

Where size is the size of a single element in bytes, newcap is the size of the
*total* memory being requested (old size + appended size) in bytes, and
log2plus1 essentially gets 1 + highest bit set.  The result is a percentage of
the total size requested to use.

Where I think the problem is using size in this equation.  I don't really
understand why size has to be taken into account since newcap already takes
size into account.

What happens is, let's say you are doing an array of bytes, and you are
requesting 100 bytes.  size is 1, newcap is 100.  log2plus1(100) is 7.  So the
resulting formula looks like:

100 + (1000L * 1) / 7 => 242, meaning 242% of the original size requested.

Now, if we use the integer, which is 4 bytes, as our array type, the formula
now has 400 as newcap and size is 4.  Resulting in:

100 + (1000L * 4) / 9 => 544, meaning 544% of the original size requested.

I'm not sure why larger array element types should need more relative space, it
doesn't make a lot of sense, but not much is explained in the comment on why
the formula is the way it is.

I'm unsure how to change this, any ideas?

For one, I think we can use the newCapacity function always, even when
appending pages (I think it will just append as many pages as it can up to the
requested size).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 02 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Fawzi Mohamed <fawzi gmx.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fawzi gmx.ch


--- Comment #7 from Fawzi Mohamed <fawzi gmx.ch> 2010-07-14 05:29:31 PDT ---
When this issue was discovered in tango I invested some time thinking about it.
What I came out with is
  http://github.com/fawzi/blip/blob/master/blip/util/Grow.d
a version of which I had also used in tango.

Looking again at it I think that I decided to grow based on memory usage, and
give "extra" space to arrays with large elements just using rounding up.
Particularly with 32-bit one should be aware that some calculation in extreme
cases could overflow.

I release the growing code with whatever license needed.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 14 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Steven Schveighoffer <schveiguy yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #8 from Steven Schveighoffer <schveiguy yahoo.com> 2010-07-14
13:36:01 PDT ---
Fixed in changeset http://www.dsource.org/projects/druntime/changeset/332

This should grow at a more normal rate (logarigthmic in relation to the size of
the array)  There will be short growths, that's when just appending free pages
can satisfy the append vs. having to commit more pages from the OS.

More tweaking may get this to be a better curve, but I'm pretty happy with it
now.

Your test program now looks like this:

length: 1 capacity: 3 ratio: inf
length: 4 capacity: 7 ratio: 2.33333
length: 8 capacity: 15 ratio: 2.14286
length: 16 capacity: 31 ratio: 2.06667
length: 32 capacity: 63 ratio: 2.03226
length: 64 capacity: 127 ratio: 2.01587
length: 128 capacity: 255 ratio: 2.00787
length: 256 capacity: 511 ratio: 2.00392
length: 512 capacity: 1019 ratio: 1.99413
length: 1020 capacity: 2043 ratio: 2.00491
length: 2044 capacity: 4091 ratio: 2.00245
length: 4092 capacity: 7163 ratio: 1.75092
length: 7164 capacity: 8187 ratio: 1.14296
length: 8188 capacity: 9211 ratio: 1.12508
length: 9212 capacity: 15355 ratio: 1.66703
length: 15356 capacity: 24571 ratio: 1.6002
length: 24572 capacity: 25595 ratio: 1.04168
length: 25596 capacity: 40955 ratio: 1.60012
length: 40956 capacity: 41979 ratio: 1.025
length: 41980 capacity: 65531 ratio: 1.56104
length: 65532 capacity: 73723 ratio: 1.12501
length: 73724 capacity: 74747 ratio: 1.01389
length: 74748 capacity: 113659 ratio: 1.52058
length: 113660 capacity: 122875 ratio: 1.08108
length: 122876 capacity: 123899 ratio: 1.00833
length: 123900 capacity: 188411 ratio: 1.52068
length: 188412 capacity: 282619 ratio: 1.50001
length: 282620 capacity: 294907 ratio: 1.04348
length: 294908 capacity: 295931 ratio: 1.00347
length: 295932 capacity: 435195 ratio: 1.4706
length: 435196 capacity: 442363 ratio: 1.01647
length: 442364 capacity: 651259 ratio: 1.47223
length: 651260 capacity: 655355 ratio: 1.00629
length: 655356 capacity: 950267 ratio: 1.45
length: 950268 capacity: 951291 ratio: 1.00108
length: 951292 capacity: 1380347 ratio: 1.45102
length: 1380348 capacity: 1392635 ratio: 1.0089
...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 14 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #9 from Ali Cehreli <acehreli yahoo.com> 2010-07-14 14:00:49 PDT ---
Thanks! Looks pretty smooth now... :)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 14 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #10 from bearophile_hugs eml.cc 2010-07-14 14:55:42 PDT ---
After this change how is the output of the program in Comment 5? Is it
unchanged?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 14 2010
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #11 from Steven Schveighoffer <schveiguy yahoo.com> 2010-07-15
07:22:10 PDT ---
The output from comment 5 looks like this:
3
1.75
1.875
1.9375
1.96875
1.98438
1.99219
1.99609
1.99023
1.50001
1.47222
1.45
1.43015
1.41
1.40007
1.38002
1.37002

and then my system starts thrashing :)  I only have 1G of RAM

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 15 2010