digitalmars.D.learn - Reserving capacity in associative arrays

Jon D (13/13) Feb 14 2016 Is there a way to reserve capacity in associative arrays? In some

sigod (2/15) Feb 14 2016 Maybe try using this: http://code.dlang.org/packages/aammm

Jon D (3/7) Feb 14 2016 Thanks, I wasn't aware of this package. I'll give it a try.

Steven Schveighoffer (11/22) Feb 16 2016 There is not a public way to access these methods unfortunately.

Basile B. (6/17) Feb 16 2016 After reading the topic i've added this enhancement proposal, not

Jon D (2/14) Feb 16 2016 There is also this: https://issues.dlang.org/show_bug.cgi?id=2504

Jon D (52/69) Feb 16 2016 My programs build AAs lasting the lifetime of the program.

H. S. Teoh via Digitalmars-d-learn (43/66) Feb 16 2016 Rehashing (aa.rehash) would resize the number of buckets, but if you

Jon D (8/23) Feb 16 2016 Thanks for the reply and the detailed example for manually

Jon D <jond noreply.com> writes:

Is there a way to reserve capacity in associative arrays? In some 
programs I've been writing I've been getting reasonable 
performance up to about 10 million entries, but beyond that 
performance is impacted considerably (say, 30 million or 50 
million entries). GC stats (via the "--DRT-gcopt=profile:1" 
option) indicate dramatic increases in gc time, which I'm 
assuming comes from resizing the underlying hash table. I'm 
guessing that by preallocating a large size the performance 
degradation would not be quite so dramatic. The underlying 
implementation of associative arrays appears to take an initial 
number of buckets, and there's a private resize() method, but 
it's not clear if there's a public way to use these.

--Jon

Feb 14 2016

sigod <sigod.mail gmail.com> writes:

On Monday, 15 February 2016 at 03:22:44 UTC, Jon D wrote:
 Is there a way to reserve capacity in associative arrays? In 
 some programs I've been writing I've been getting reasonable 
 performance up to about 10 million entries, but beyond that 
 performance is impacted considerably (say, 30 million or 50 
 million entries). GC stats (via the "--DRT-gcopt=profile:1" 
 option) indicate dramatic increases in gc time, which I'm 
 assuming comes from resizing the underlying hash table. I'm 
 guessing that by preallocating a large size the performance 
 degradation would not be quite so dramatic. The underlying 
 implementation of associative arrays appears to take an initial 
 number of buckets, and there's a private resize() method, but 
 it's not clear if there's a public way to use these.

 --Jon

Maybe try using this: http://code.dlang.org/packages/aammm

Feb 14 2016

Jon D <jond noreply.com> writes:

On Monday, 15 February 2016 at 05:29:23 UTC, sigod wrote:
 On Monday, 15 February 2016 at 03:22:44 UTC, Jon D wrote:
 Is there a way to reserve capacity in associative arrays?
 [snip]

 Maybe try using this: http://code.dlang.org/packages/aammm

Thanks, I wasn't aware of this package. I'll give it a try.

--Jon

Feb 14 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 2/14/16 10:22 PM, Jon D wrote:
 Is there a way to reserve capacity in associative arrays? In some
 programs I've been writing I've been getting reasonable performance up
 to about 10 million entries, but beyond that performance is impacted
 considerably (say, 30 million or 50 million entries). GC stats (via the
 "--DRT-gcopt=profile:1" option) indicate dramatic increases in gc time,
 which I'm assuming comes from resizing the underlying hash table. I'm
 guessing that by preallocating a large size the performance degradation
 would not be quite so dramatic. The underlying implementation of
 associative arrays appears to take an initial number of buckets, and
 there's a private resize() method, but it's not clear if there's a
 public way to use these.

There is not a public way to access these methods unfortunately.

It would be a good addition to druntime I believe.

Recently, I added a clear method to the AA, which does not reduce 
capacity. So if you frequently build large AAs, and then throw them 
away, you could instead reuse the memory.

I would caution to be sure of this cause, however, before thinking it 
would solve the problem. The AA not only uses an array for buckets, but 
allocates a memory location for each element as well. I'm often wrong 
when I assume what the problem is when it comes to GC issues...

-Steve

Feb 16 2016

Basile B. <b2.temp gmx.com> writes:

On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
Schveighoffer wrote:
 There is not a public way to access these methods unfortunately.

 It would be a good addition to druntime I believe.

 Recently, I added a clear method to the AA, which does not 
 reduce capacity. So if you frequently build large AAs, and then 
 throw them away, you could instead reuse the memory.

 I would caution to be sure of this cause, however, before 
 thinking it would solve the problem. The AA not only uses an 
 array for buckets, but allocates a memory location for each 
 element as well. I'm often wrong when I assume what the problem 
 is when it comes to GC issues...

 -Steve

After reading the topic i've added this enhancement proposal, not 
quite sure if it's possible:

https://issues.dlang.org/show_bug.cgi?id=15682

The idea is to concatenate smallers AA into the destination.

Feb 16 2016

Jon D <jond noreply.com> writes:

On Tuesday, 16 February 2016 at 17:05:11 UTC, Basile B. wrote:
 On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
 Schveighoffer wrote:
 There is not a public way to access these methods 
 unfortunately.

 It would be a good addition to druntime I believe.

 -Steve

 After reading the topic i've added this enhancement proposal, 
 not quite sure if it's possible:

 https://issues.dlang.org/show_bug.cgi?id=15682

 The idea is to concatenate smallers AA into the destination.

There is also this: https://issues.dlang.org/show_bug.cgi?id=2504

Feb 16 2016

Jon D <jond noreply.com> writes:

On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
Schveighoffer wrote:
 On 2/14/16 10:22 PM, Jon D wrote:
 Is there a way to reserve capacity in associative arrays?
 [snip]
 The underlying implementation of associative arrays
 appears to take an initial number of buckets, and there's
 a private resize() method, but it's not clear if there's a
 public way to use these.

 There is not a public way to access these methods unfortunately.

 It would be a good addition to druntime I believe.

 Recently, I added a clear method to the AA, which does not 
 reduce capacity. So if you frequently build large AAs, and then 
 throw them away, you could instead reuse the memory.

My programs build AAs lasting the lifetime of the program.
 I would caution to be sure of this cause, however, before 
 thinking it would solve the problem. The AA not only uses an 
 array for buckets, but allocates a memory location for each 
 element as well. I'm often wrong when I assume what the problem 
 is when it comes to GC issues...

Completely agree. After posting I decided to take a more 
methodical look. Not finished yet, but I can share part of it. 
Key thing so far is noticeable step function related to GC costs 
related to AA size (likely not a surprise).

My programs work with large data sets. Size is open-ended, what 
I'm trying to do is get an idea of the data set sizes they will 
handle reasonably. For purposes of illustration, word-count is a 
reasonable proxy for what I'm doing. It was in this context that 
I saw significant performance drop-off after 'size_t[string]' AAs 
reached about 10 million entries.

I've started measuring with a simple program. Basically:

     StopWatch sw;
     sw.start;
     size_t[size_t] counts;
     foreach (i; 0..iterations)
         counts[uniform(0, uniqMax)]++;
     sw.stop;

Same thing with string as key ('size_t[string]') AAs. 
'iterations' and 'uniqMax' are varied between runs. GC stats are 
printed (via "--DRT-gcopt=profile:1"), plus timing and AA size. 
(Runs use LDC 17, release mode compiles, a fast 16GB MacBook).

For the integer as key case ('size_t[size_t]', there are notable 
jumps in GC total time and GC max pause time as AA size crosses 
specific size thresholds. This makes sense, as the AA needs to 
grow. Approximate steps:

     | entries | gc_total (ms) | gc_max_pause (ms) |
     |---------+---------------+-------------------|
     | 2M      |            30 |                60 |
     | 4M      |           200 |               100 |
     | 12M     |           650 |               330 |
     | 22M     |          1650 |               750 |
     | 44M     |          5300 |              3200 |

Iterations didn't matter, and gc total time and gc max time were 
largely flat between these jumps.

This suggests AA resize is the likely driver, and that 
preallocating a large size might address it.

To the point about being sure about cause - my programs use 
strings as keys, not integers. The performance drop-off with 
strings was quite a bit more significant than with integers. That 
analysis seems a bit trickier, I'm not done with that yet. 
Different memory allocation, perhaps effects from creating 
short-lived, temporary strings to test AA membership. Could 
easily be that string use or the combo of AAs with strings as key 
is a larger effect.

The other thing that jumps out from the table is the GC max pause 
time gets to be multiple seconds. Not an issue for my tools, 
which aren't interactive at those points, but would be 
significant issue for many interactive apps.

--Jon

Feb 16 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via Digitalmars-d-learn wrote:
 On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
On 2/14/16 10:22 PM, Jon D wrote:
Is there a way to reserve capacity in associative arrays?
[snip]
The underlying implementation of associative arrays appears to take
an initial number of buckets, and there's a private resize() method,
but it's not clear if there's a public way to use these.



Rehashing (aa.rehash) would resize the number of buckets, but if you
don't already have the requisite number of keys, it wouldn't help.


[...]
I would caution to be sure of this cause, however, before thinking it
would solve the problem. The AA not only uses an array for buckets,
but allocates a memory location for each element as well. I'm often
wrong when I assume what the problem is when it comes to GC issues...

 Completely agree. After posting I decided to take a more methodical
 look.  Not finished yet, but I can share part of it. Key thing so far
 is noticeable step function related to GC costs related to AA size
 (likely not a surprise).
 
 My programs work with large data sets. Size is open-ended, what I'm
 trying to do is get an idea of the data set sizes they will handle
 reasonably. For purposes of illustration, word-count is a reasonable
 proxy for what I'm doing. It was in this context that I saw
 significant performance drop-off after 'size_t[string]' AAs reached
 about 10 million entries.

I also have a program that builds very large AA's (also up to millions
of keys, sometimes more) that essentially last for the entire duration
of the program, and I've found that a lot of the performance degradation
can be attributed to overly-frequent GC collection runs.  As a
workaround, I did something like this:

	size_t counter = 0;

	void main() {
		GC.disable();
		doWork();
		GC.enable();	// optional
	}

	void doWork() {
		Data[Key] aa = ...;

		mainloop: while(...) {
			... // lots of stuff, including adding things to the AA

			// Manually schedule GC collections.
			// The 10_000 is just an example number, tweak
			// as you see fit.
			if (++counter % 10_000)
				GC.collect();
		}
	}

Basically, I run GC collections on my own schedule, with the frequency
controlled by tweaking the counter check. (Obviously, how you implement
this depends on whether there are regular or semi-regular units of work
that your program does, and whether they are a good basis for scheduling
collections.)

By tweaking the frequency of the manual GC collections, I've been able
to obtain 30-40% speedups in my program (in some cases, it could get as
high as 50%). YMMV.

Of course, as with all performance-related questions, the only way to be
sure that the GC (or anything else) is the cause of your problem, is to
profile.  More often than not, I've found that the profiler has proven
me wrong about where the real bottleneck is, so I try not to assume
anything until I see the profile data.


T

-- 
If Java had true garbage collection, most programs would delete
themselves upon execution. -- Robert Sewell

Feb 16 2016

Jon D <jond noreply.com> writes:

On Tuesday, 16 February 2016 at 19:49:55 UTC, H. S. Teoh wrote:
 On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via 
 Digitalmars-d-learn wrote:
 On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
 Schveighoffer wrote:
On 2/14/16 10:22 PM, Jon D wrote:
Is there a way to reserve capacity in associative arrays?
[snip]
The underlying implementation of associative arrays appears 
to take
an initial number of buckets, and there's a private resize() 
method,
but it's not clear if there's a public way to use these.



 Rehashing (aa.rehash) would resize the number of buckets, but 
 if you don't already have the requisite number of keys, it 
 wouldn't help.

Thanks for the reply and the detailed example for manually 
controlling GC. I haven't experimented with taking control over 
GC that way.

Regarding reserving capacity, the relevant method is aa.resize(), 
not aa.rehash(). See: 
https://github.com/D-Programming-Language/druntime/blob/maste
/src/rt/aaA.d#L141. This allocates space for the buckets, doesn't matter if the
keys are known. Note that every time the buckets array is resized the old
bucket array is walked and elements reinserted. Preallocating allocating a
large bucket array would avoid this. See also the private constructor in the
same file (line 51). It takes an initial size.

--Jon

Feb 16 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reserving capacity in associative arrays