www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Asking about performance of std concatenation vs. Appender vs. custom

reply ludo <fakeaddress gmail.com> writes:
Hi fellow Dlangers,

I am working on an old codebase, and found a re-implementation of 
standard dynamic arrays as a struct called [ArrayBuilder, visible 
on my github](https://tinyurl.com/2nr25ytt)

The comment says:
```d
/**
  * Behaves the same as built-in arrays, except about 6x faster 
with concatenation at the expense of the base pointer being 4 
system words instead of two (16 instead of 8 bytes on a 32-bit 
system).
  */
```

So I wrote a unittest to verify that claim, using for benchmark 
10_000_000 concatenations :

```d
trace("*** ArrayBuilder Concatenation benchmark ***");
import std.datetime.stopwatch : benchmark, Duration;
import std.array : appender;

		void concat_withStd()
		{
			int[] array;
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withStdReserve()
		{
			int[] array;
			array.reserve(1000);
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withStdLength()
		{
			int[] array;
			array.length = 1000;
			for (int j = 0; j < 1000; j++)
				array[j] = j;
		}

		void concat_withAppenderReserve()
		{
			auto array = appender!(int[]);
			//trace ("array is ", array);
			array.reserve(1000);
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withArrayBuilder()
		{
			ArrayBuilder!(int) array;
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		auto r = benchmark!(concat_withStd,
		(...)//stuff
		concat_withArrayBuilder)(10_000);

		tracef("with std: %s", bench1);
		tracef("with stdReserve: %s", bench2);
		tracef("with stdLength: %s", bench3);
		tracef("with AppenderReserve: %s", bench4);
		tracef("with Arraybuilder: %s \n", bench5);
```

The results are below (you can also git clone the repo + dub 
test):

| Concatenation method | benchmark in ms|
|---------------------|---------------------|
|with std:            | 385 ms|
|with stdReserve:     | 327 ms|
|with stdLength:      | 29 ms|
|with AppenderReserve:| 256 ms|
|with Arraybuilder:   | 118 ms|

- The use of reserve does not seem to improve much the standard 
concatenation performance. **Would you know why?**
- Increasing the length of the array before-hand is by far the 
best solution, as expected.
- AppenderReserve outperforms a basic concatenation, but the gain 
is not completely obvious, even with a call to reserve. **Is it 
expected behavior?**

- ArrayBuilder is 3 times faster than standard concatenation, and 
twice as fast as AppenderReserve! **Which explanation would you 
give?** I looked at the ArrayBuilder struct, it's a very basic 
code with no magic trick. I also looked at the [phobos Appender 
code on 
github](https://github.com/dlang/phobos/blob/master/std/array.d), 
in particular the code for put() line 3460, it's obviously more 
complex:

```d
            import core.internal.lifetime : emplaceRef;

             ensureAddable(1);
             immutable len = _data.arr.length;

             auto bigData = (()  trusted => _data.arr.ptr[0 .. len 
+ 1])();
             emplaceRef!(Unqual!T)(bigData[len], cast() item);
             //We do this at the end, in case of exceptions
             _data.arr = bigData;

```

There are some functions calls... Maybe the drag comes from 
there. I am really out of my league here, any help appreciated. I 
am not even sure which part of the Appender code is activated.

cheers
Apr 01
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/1/21 10:53 AM, ludo wrote:

 The results are below (you can also git clone the repo + dub test):
 
 | Concatenation method | benchmark in ms|
 |---------------------|---------------------|
 |with std:            | 385 ms|
 |with stdReserve:     | 327 ms|
 |with stdLength:      | 29 ms|
 |with AppenderReserve:| 256 ms|
 |with Arraybuilder:   | 118 ms|
 
 - The use of reserve does not seem to improve much the standard 
 concatenation performance. **Would you know why?**
My guess: reserving cuts down on the reallocations, but that only takes some of the time. Appending a 1000-element int array is going to go from a 16-byte block, to a 32-byte block, etc. up to a 4096 byte block. This involves roughly 8 reallocations per test. But every append requires an opaque function call into the runtime to check if the allocated length is big enough, and reallocate if necessary. So if the reallocations aren't too expensive, then appending performance would not be too much different.
 - AppenderReserve outperforms a basic concatenation, but the gain is not 
 completely obvious, even with a call to reserve. **Is it expected 
 behavior?**
Yes, because Appender has direct access to the allocated length, rather than having to go through an opaque call.
 
 - ArrayBuilder is 3 times faster than standard concatenation, and twice 
 as fast as AppenderReserve! **Which explanation would you give?** 
This I have no idea on. There are probably a lot of explanations that could be true. Have you tried profiling the code?
 There are some functions calls... Maybe the drag comes from there. I am 
 really out of my league here, any help appreciated. I am not even sure 
 which part of the Appender code is activated.
Are you compiling with -inline -O? -Steve
Apr 01
parent ludo <fakeaddress gmail.com> writes:
 reserving cuts down on the reallocations, but that only takes 
 some of the time. Appending a 1000-element int array is going 
 to go from a 16-byte block, to a 32-byte block, etc. up to a 
 4096 byte block. This involves roughly 8 reallocations per test.

 But every append requires an opaque function call into the 
 runtime to check if the allocated length is big enough, and 
 reallocate if necessary.

 So if the reallocations aren't too expensive, then appending 
 performance would not be too much different.
For ArrayBuilder, the function call to check length is there too, it is the grow() function. Sure reallocations take time. But the comment of the class talks about an increased performance thanks to *the base pointer being 4 system words instead of two* What did the dev mean by that?
 Yes, because Appender has direct access to the allocated 
 length, rather than having to go through an opaque call.
Alright, ArrayBuilder definitely benefits from the same.
 This I have no idea on. There are probably a lot of 
 explanations that could be true. Have you tried profiling the 
 code?
I just tried to run dub test --build=profile on dterrent, but I get some obscure error message. Not working. Also tried to dmd -unittest -profile arraybuilder.d with an empty main(), not working, I don't get a trace.log. Am I supposed to create a main() test program to have access to profiling, that is to say there is no possible profiling of unittest?
 Are you compiling with -inline -O?

 -Steve
-inline does not work (segfault) but optimize does not change anything except for stdLength (10ms) and ... ArrayBuilder, which gain even more performance (90ms). Conclusion, ArrayBuilder is still much more efficient, 15 years after this code was written :) It behaves closer to a regular array, maybe because of space reallocation strategy? Any comment appreciated
Apr 09