digitalmars.D.learn - Asking about performance of std concatenation vs. Appender vs. custom

ludo (100/100) Apr 01 2021 Hi fellow Dlangers,

Steven Schveighoffer (16/37) Apr 01 2021 My guess:

ludo (19/35) Apr 09 2021 For ArrayBuilder, the function call to check length is there too,

ludo <fakeaddress gmail.com> writes:

Hi fellow Dlangers,

I am working on an old codebase, and found a re-implementation of 
standard dynamic arrays as a struct called [ArrayBuilder, visible 
on my github](https://tinyurl.com/2nr25ytt)

The comment says:
```d
/**
  * Behaves the same as built-in arrays, except about 6x faster 
with concatenation at the expense of the base pointer being 4 
system words instead of two (16 instead of 8 bytes on a 32-bit 
system).
  */
```

So I wrote a unittest to verify that claim, using for benchmark 
10_000_000 concatenations :

```d
trace("*** ArrayBuilder Concatenation benchmark ***");
import std.datetime.stopwatch : benchmark, Duration;
import std.array : appender;

		void concat_withStd()
		{
			int[] array;
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withStdReserve()
		{
			int[] array;
			array.reserve(1000);
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withStdLength()
		{
			int[] array;
			array.length = 1000;
			for (int j = 0; j < 1000; j++)
				array[j] = j;
		}

		void concat_withAppenderReserve()
		{
			auto array = appender!(int[]);
			//trace ("array is ", array);
			array.reserve(1000);
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		void concat_withArrayBuilder()
		{
			ArrayBuilder!(int) array;
			for (int j = 0; j < 1000; j++)
				array ~= j;
		}

		auto r = benchmark!(concat_withStd,
		(...)//stuff
		concat_withArrayBuilder)(10_000);

		tracef("with std: %s", bench1);
		tracef("with stdReserve: %s", bench2);
		tracef("with stdLength: %s", bench3);
		tracef("with AppenderReserve: %s", bench4);
		tracef("with Arraybuilder: %s \n", bench5);
```

The results are below (you can also git clone the repo + dub 
test):

| Concatenation method | benchmark in ms|
|---------------------|---------------------|
|with std:            | 385 ms|
|with stdReserve:     | 327 ms|
|with stdLength:      | 29 ms|
|with AppenderReserve:| 256 ms|
|with Arraybuilder:   | 118 ms|

- The use of reserve does not seem to improve much the standard 
concatenation performance. **Would you know why?**
- Increasing the length of the array before-hand is by far the 
best solution, as expected.
- AppenderReserve outperforms a basic concatenation, but the gain 
is not completely obvious, even with a call to reserve. **Is it 
expected behavior?**

- ArrayBuilder is 3 times faster than standard concatenation, and 
twice as fast as AppenderReserve! **Which explanation would you 
give?** I looked at the ArrayBuilder struct, it's a very basic 
code with no magic trick. I also looked at the [phobos Appender 
code on 
github](https://github.com/dlang/phobos/blob/master/std/array.d), 
in particular the code for put() line 3460, it's obviously more 
complex:

```d
            import core.internal.lifetime : emplaceRef;

             ensureAddable(1);
             immutable len = _data.arr.length;

             auto bigData = (()  trusted => _data.arr.ptr[0 .. len 
+ 1])();
             emplaceRef!(Unqual!T)(bigData[len], cast() item);
             //We do this at the end, in case of exceptions
             _data.arr = bigData;

```

There are some functions calls... Maybe the drag comes from 
there. I am really out of my league here, any help appreciated. I 
am not even sure which part of the Appender code is activated.

cheers

Apr 01 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 4/1/21 10:53 AM, ludo wrote:

 The results are below (you can also git clone the repo + dub test):
 
 | Concatenation method | benchmark in ms|
 |---------------------|---------------------|
 |with std:            | 385 ms|
 |with stdReserve:     | 327 ms|
 |with stdLength:      | 29 ms|
 |with AppenderReserve:| 256 ms|
 |with Arraybuilder:   | 118 ms|
 
 - The use of reserve does not seem to improve much the standard 
 concatenation performance. **Would you know why?**

My guess:

reserving cuts down on the reallocations, but that only takes some of 
the time. Appending a 1000-element int array is going to go from a 
16-byte block, to a 32-byte block, etc. up to a 4096 byte block. This 
involves roughly 8 reallocations per test.

But every append requires an opaque function call into the runtime to 
check if the allocated length is big enough, and reallocate if necessary.

So if the reallocations aren't too expensive, then appending performance 
would not be too much different.

 - AppenderReserve outperforms a basic concatenation, but the gain is not 
 completely obvious, even with a call to reserve. **Is it expected 
 behavior?**

Yes, because Appender has direct access to the allocated length, rather 
than having to go through an opaque call.

 
 - ArrayBuilder is 3 times faster than standard concatenation, and twice 
 as fast as AppenderReserve! **Which explanation would you give?** 

This I have no idea on. There are probably a lot of explanations that 
could be true. Have you tried profiling the code?


 There are some functions calls... Maybe the drag comes from there. I am 
 really out of my league here, any help appreciated. I am not even sure 
 which part of the Appender code is activated.

Are you compiling with -inline -O?

-Steve

Apr 01 2021

ludo <fakeaddress gmail.com> writes:

 reserving cuts down on the reallocations, but that only takes 
 some of the time. Appending a 1000-element int array is going 
 to go from a 16-byte block, to a 32-byte block, etc. up to a 
 4096 byte block. This involves roughly 8 reallocations per test.

 But every append requires an opaque function call into the 
 runtime to check if the allocated length is big enough, and 
 reallocate if necessary.

 So if the reallocations aren't too expensive, then appending 
 performance would not be too much different.

For ArrayBuilder, the function call to check length is there too, 
it is the grow() function. Sure reallocations take time. But the 
comment of the class talks about an increased performance thanks 
to *the base pointer being 4 system words instead of two*

What did the dev mean by that?

 Yes, because Appender has direct access to the allocated 
 length, rather than having to go through an opaque call.

Alright, ArrayBuilder definitely benefits from the same.

 This I have no idea on. There are probably a lot of 
 explanations that could be true. Have you tried profiling the 
 code?

I just tried to run dub test --build=profile on dterrent, but I 
get some obscure error message. Not working. Also tried to dmd 
-unittest -profile arraybuilder.d with an empty
main(), not working, I don't get a trace.log. Am I supposed to 
create a main() test program to have access to profiling, that is 
to say there is no possible profiling of unittest?

 Are you compiling with -inline -O?

 -Steve

-inline does not work (segfault) but optimize does not change 
anything except for stdLength (10ms) and ... ArrayBuilder, which 
gain even more performance (90ms).

Conclusion, ArrayBuilder is still much more efficient, 15 years 
after this code was written :) It behaves closer to a regular 
array, maybe because of space reallocation strategy?

Any comment appreciated

Apr 09 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Asking about performance of std concatenation vs. Appender vs. custom