www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reserving capacity in associative arrays

reply Jon D <jond noreply.com> writes:
Is there a way to reserve capacity in associative arrays? In some 
programs I've been writing I've been getting reasonable 
performance up to about 10 million entries, but beyond that 
performance is impacted considerably (say, 30 million or 50 
million entries). GC stats (via the "--DRT-gcopt=profile:1" 
option) indicate dramatic increases in gc time, which I'm 
assuming comes from resizing the underlying hash table. I'm 
guessing that by preallocating a large size the performance 
degradation would not be quite so dramatic. The underlying 
implementation of associative arrays appears to take an initial 
number of buckets, and there's a private resize() method, but 
it's not clear if there's a public way to use these.

--Jon
Feb 14 2016
next sibling parent reply sigod <sigod.mail gmail.com> writes:
On Monday, 15 February 2016 at 03:22:44 UTC, Jon D wrote:
 Is there a way to reserve capacity in associative arrays? In 
 some programs I've been writing I've been getting reasonable 
 performance up to about 10 million entries, but beyond that 
 performance is impacted considerably (say, 30 million or 50 
 million entries). GC stats (via the "--DRT-gcopt=profile:1" 
 option) indicate dramatic increases in gc time, which I'm 
 assuming comes from resizing the underlying hash table. I'm 
 guessing that by preallocating a large size the performance 
 degradation would not be quite so dramatic. The underlying 
 implementation of associative arrays appears to take an initial 
 number of buckets, and there's a private resize() method, but 
 it's not clear if there's a public way to use these.

 --Jon
Maybe try using this: http://code.dlang.org/packages/aammm
Feb 14 2016
parent Jon D <jond noreply.com> writes:
On Monday, 15 February 2016 at 05:29:23 UTC, sigod wrote:
 On Monday, 15 February 2016 at 03:22:44 UTC, Jon D wrote:
 Is there a way to reserve capacity in associative arrays?
 [snip]
Maybe try using this: http://code.dlang.org/packages/aammm
Thanks, I wasn't aware of this package. I'll give it a try. --Jon
Feb 14 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 2/14/16 10:22 PM, Jon D wrote:
 Is there a way to reserve capacity in associative arrays? In some
 programs I've been writing I've been getting reasonable performance up
 to about 10 million entries, but beyond that performance is impacted
 considerably (say, 30 million or 50 million entries). GC stats (via the
 "--DRT-gcopt=profile:1" option) indicate dramatic increases in gc time,
 which I'm assuming comes from resizing the underlying hash table. I'm
 guessing that by preallocating a large size the performance degradation
 would not be quite so dramatic. The underlying implementation of
 associative arrays appears to take an initial number of buckets, and
 there's a private resize() method, but it's not clear if there's a
 public way to use these.
There is not a public way to access these methods unfortunately. It would be a good addition to druntime I believe. Recently, I added a clear method to the AA, which does not reduce capacity. So if you frequently build large AAs, and then throw them away, you could instead reuse the memory. I would caution to be sure of this cause, however, before thinking it would solve the problem. The AA not only uses an array for buckets, but allocates a memory location for each element as well. I'm often wrong when I assume what the problem is when it comes to GC issues... -Steve
Feb 16 2016
next sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
Schveighoffer wrote:
 There is not a public way to access these methods unfortunately.

 It would be a good addition to druntime I believe.

 Recently, I added a clear method to the AA, which does not 
 reduce capacity. So if you frequently build large AAs, and then 
 throw them away, you could instead reuse the memory.

 I would caution to be sure of this cause, however, before 
 thinking it would solve the problem. The AA not only uses an 
 array for buckets, but allocates a memory location for each 
 element as well. I'm often wrong when I assume what the problem 
 is when it comes to GC issues...

 -Steve
After reading the topic i've added this enhancement proposal, not quite sure if it's possible: https://issues.dlang.org/show_bug.cgi?id=15682 The idea is to concatenate smallers AA into the destination.
Feb 16 2016
parent Jon D <jond noreply.com> writes:
On Tuesday, 16 February 2016 at 17:05:11 UTC, Basile B. wrote:
 On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
 Schveighoffer wrote:
 There is not a public way to access these methods 
 unfortunately.

 It would be a good addition to druntime I believe.

 -Steve
After reading the topic i've added this enhancement proposal, not quite sure if it's possible: https://issues.dlang.org/show_bug.cgi?id=15682 The idea is to concatenate smallers AA into the destination.
There is also this: https://issues.dlang.org/show_bug.cgi?id=2504
Feb 16 2016
prev sibling parent reply Jon D <jond noreply.com> writes:
On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
Schveighoffer wrote:
 On 2/14/16 10:22 PM, Jon D wrote:
 Is there a way to reserve capacity in associative arrays?
 [snip]
 The underlying implementation of associative arrays
 appears to take an initial number of buckets, and there's
 a private resize() method, but it's not clear if there's a
 public way to use these.
There is not a public way to access these methods unfortunately. It would be a good addition to druntime I believe. Recently, I added a clear method to the AA, which does not reduce capacity. So if you frequently build large AAs, and then throw them away, you could instead reuse the memory.
My programs build AAs lasting the lifetime of the program.
 I would caution to be sure of this cause, however, before 
 thinking it would solve the problem. The AA not only uses an 
 array for buckets, but allocates a memory location for each 
 element as well. I'm often wrong when I assume what the problem 
 is when it comes to GC issues...
Completely agree. After posting I decided to take a more methodical look. Not finished yet, but I can share part of it. Key thing so far is noticeable step function related to GC costs related to AA size (likely not a surprise). My programs work with large data sets. Size is open-ended, what I'm trying to do is get an idea of the data set sizes they will handle reasonably. For purposes of illustration, word-count is a reasonable proxy for what I'm doing. It was in this context that I saw significant performance drop-off after 'size_t[string]' AAs reached about 10 million entries. I've started measuring with a simple program. Basically: StopWatch sw; sw.start; size_t[size_t] counts; foreach (i; 0..iterations) counts[uniform(0, uniqMax)]++; sw.stop; Same thing with string as key ('size_t[string]') AAs. 'iterations' and 'uniqMax' are varied between runs. GC stats are printed (via "--DRT-gcopt=profile:1"), plus timing and AA size. (Runs use LDC 17, release mode compiles, a fast 16GB MacBook). For the integer as key case ('size_t[size_t]', there are notable jumps in GC total time and GC max pause time as AA size crosses specific size thresholds. This makes sense, as the AA needs to grow. Approximate steps: | entries | gc_total (ms) | gc_max_pause (ms) | |---------+---------------+-------------------| | 2M | 30 | 60 | | 4M | 200 | 100 | | 12M | 650 | 330 | | 22M | 1650 | 750 | | 44M | 5300 | 3200 | Iterations didn't matter, and gc total time and gc max time were largely flat between these jumps. This suggests AA resize is the likely driver, and that preallocating a large size might address it. To the point about being sure about cause - my programs use strings as keys, not integers. The performance drop-off with strings was quite a bit more significant than with integers. That analysis seems a bit trickier, I'm not done with that yet. Different memory allocation, perhaps effects from creating short-lived, temporary strings to test AA membership. Could easily be that string use or the combo of AAs with strings as key is a larger effect. The other thing that jumps out from the table is the GC max pause time gets to be multiple seconds. Not an issue for my tools, which aren't interactive at those points, but would be significant issue for many interactive apps. --Jon
Feb 16 2016
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via Digitalmars-d-learn wrote:
 On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
On 2/14/16 10:22 PM, Jon D wrote:
Is there a way to reserve capacity in associative arrays?
[snip]
The underlying implementation of associative arrays appears to take
an initial number of buckets, and there's a private resize() method,
but it's not clear if there's a public way to use these.
Rehashing (aa.rehash) would resize the number of buckets, but if you don't already have the requisite number of keys, it wouldn't help. [...]
I would caution to be sure of this cause, however, before thinking it
would solve the problem. The AA not only uses an array for buckets,
but allocates a memory location for each element as well. I'm often
wrong when I assume what the problem is when it comes to GC issues...
Completely agree. After posting I decided to take a more methodical look. Not finished yet, but I can share part of it. Key thing so far is noticeable step function related to GC costs related to AA size (likely not a surprise). My programs work with large data sets. Size is open-ended, what I'm trying to do is get an idea of the data set sizes they will handle reasonably. For purposes of illustration, word-count is a reasonable proxy for what I'm doing. It was in this context that I saw significant performance drop-off after 'size_t[string]' AAs reached about 10 million entries.
I also have a program that builds very large AA's (also up to millions of keys, sometimes more) that essentially last for the entire duration of the program, and I've found that a lot of the performance degradation can be attributed to overly-frequent GC collection runs. As a workaround, I did something like this: size_t counter = 0; void main() { GC.disable(); doWork(); GC.enable(); // optional } void doWork() { Data[Key] aa = ...; mainloop: while(...) { ... // lots of stuff, including adding things to the AA // Manually schedule GC collections. // The 10_000 is just an example number, tweak // as you see fit. if (++counter % 10_000) GC.collect(); } } Basically, I run GC collections on my own schedule, with the frequency controlled by tweaking the counter check. (Obviously, how you implement this depends on whether there are regular or semi-regular units of work that your program does, and whether they are a good basis for scheduling collections.) By tweaking the frequency of the manual GC collections, I've been able to obtain 30-40% speedups in my program (in some cases, it could get as high as 50%). YMMV. Of course, as with all performance-related questions, the only way to be sure that the GC (or anything else) is the cause of your problem, is to profile. More often than not, I've found that the profiler has proven me wrong about where the real bottleneck is, so I try not to assume anything until I see the profile data. T -- If Java had true garbage collection, most programs would delete themselves upon execution. -- Robert Sewell
Feb 16 2016
parent Jon D <jond noreply.com> writes:
On Tuesday, 16 February 2016 at 19:49:55 UTC, H. S. Teoh wrote:
 On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via 
 Digitalmars-d-learn wrote:
 On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
 Schveighoffer wrote:
On 2/14/16 10:22 PM, Jon D wrote:
Is there a way to reserve capacity in associative arrays?
[snip]
The underlying implementation of associative arrays appears 
to take
an initial number of buckets, and there's a private resize() 
method,
but it's not clear if there's a public way to use these.
Rehashing (aa.rehash) would resize the number of buckets, but if you don't already have the requisite number of keys, it wouldn't help.
Thanks for the reply and the detailed example for manually controlling GC. I haven't experimented with taking control over GC that way. Regarding reserving capacity, the relevant method is aa.resize(), not aa.rehash(). See: https://github.com/D-Programming-Language/druntime/blob/maste /src/rt/aaA.d#L141. This allocates space for the buckets, doesn't matter if the keys are known. Note that every time the buckets array is resized the old bucket array is walked and elements reinserted. Preallocating allocating a large bucket array would avoid this. See also the private constructor in the same file (line 51). It takes an initial size. --Jon
Feb 16 2016