www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Sometimes 100 lines of code can fix your ctfe perf

reply Stefan Koch <uplink.coder googlemail.com> writes:
Good Morning,

I just saw the need to concatenate a long-ish string at ctfe.

TLDR; writing your own specialized chunking appender is way 
faster than naive CTFE
using the phobos provided appender seems to be a bad idea ;) at 
least at CTFE.

As you may know when you try to so this the naive way:
```d
enum x = () { string result;
foreach(_; 0 ..n_iterations)
   result ~= "whatever"
return result; } ()
```

it scales quite poorly as `n_iterations` grows.
and eventually you'll run out of memory and time.

it turns out ctfe really doesn't like appending because it 
doesn't preallocate buffers or keep capacity around.

So what if we made our own string-builder that uses whole chunks 
to do the appending in?
in about 70 lines of code we can build our own string build that 
appends strings.

```d
module stringBuilder;
struct StringBuilder
{
     enum CHUNK_SIZE = 4096;
     char[CHUNK_SIZE] fixedBuffer;
     size_t fixedBufferUsed;

     char[CHUNK_SIZE][] chunks;
     size_t[] chunk_lengths;

     ushort last_chunk_used;

     void appendString(const(char)[] s)
     {
         // can we used our preallocated fixed buffer?
         if (!last_chunk_used)
         {
             auto fixedBufferEnd = fixedBufferUsed + s.length;
             if (fixedBufferEnd < CHUNK_SIZE)
             {
                 fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = 
s;
                 fixedBufferUsed = fixedBufferEnd;
             }
         }
         else
         {
         LappendToLastChunk:
             auto chunkBegin = chunk_lengths[last_chunk_used - 1];
             auto chunkEnd = chunkBegin + s.length;
             if (chunkEnd < CHUNK_SIZE)
             {
                 chunks[last_chunk_used - 1][chunkBegin .. 
chunkEnd] = s;
                 chunk_lengths[last_chunk_used - 1] = chunkEnd;
             }
             else if (s.length <= CHUNK_SIZE)
             {
                 chunks.length = last_chunk_used + 1;
                 last_chunk_used += 1;
                 goto LappendToLastChunk;
             }
             else
             {
                 assert(0,
                         "string to be appended is longer than "
                         ~ CHUNK_SIZE.stringof ~ " this is not 
supported currently");
             }
         }
     }

     string releaseBuffer()
     {
         size_t result_length = fixedBufferUsed;
         foreach (i; 0 .. last_chunk_used - 1)
         {
             result_length += chunk_lengths[i];
         }

         char[] result;
         result.length = result_length;
         result[0 .. fixedBufferUsed] = fixedBuffer[0 .. 
fixedBufferUsed];
         size_t result_begin_pos = fixedBufferUsed;

         foreach (i; 0 .. last_chunk_used - 1)
         {
             const chunk_used = chunk_lengths[i];
             result[result_begin_pos .. result_begin_pos + 
chunk_used] = chunks[i][0 .. chunk_used];
             result_begin_pos += chunk_used;
         }
         return cast(string) result;
     }
}
```

that wasn't too bad was it?
But will such a simple piece of code really be able to speed us 
up?

lets do a quick perf test.

```d
import stringBuilder;
static if (is(typeof(import ("N_ITERATIONS"))))
{
     enum n_iterations = mixin(import("N_ITERATIONS"));
}
else
{
     enum n_iterations = 4096;
}

pragma(msg, "N: ", n_iterations);

enum big_string =
() {
     version (StringBuilder)
     {
         StringBuilder b;
     }
     else version (Appender)
     {
         import std.array : Appender;
         Appender!(string) result;
     }
     else
     {
         string result;
     }

     foreach(_; 0 ..n_iterations)
     {
         version(StringBuilder)
         {
             b.appendString("Let us see ... how meta can you 
go?\n");
         }
         else version (Appender)
         {
             result ~=     ("Let us see ... how meta can you 
go?\n");
         }
         else
         {
             result ~=     ("Let us see ... how meta can you 
go?\n");
         }
     }
     version (StringBuilder)
     {
         return b.releaseBuffer();
     }
     else version (Appender)
     {
         return result.opSlice();
     }
     else
         return result;
} ();

pragma(msg, big_string);
```

And here are the results for various values of N_ITERATIONS:
(times are in milliseconds and are the time) `dmd -c takes`
```
|   N   |  stringBuilder  | naive   | Appender                    
|
|-------|-----------------|---------|-----------------------------|
| 32    | 28.1            | 28      | 53                          
|
| 128   | 29.0            | 27.3    | 58                          
|
| 512   | 29.4            | 29.7    | 120.4                       
|
| 1024  | 30.6            | 33.7    | 304.4                       
|
| 2048  | 31.8            | 51.2    | 1055                        
|
| 4096  | 35.1            | 120.8   | 4338                        
|
| 8192  | 41.0            | 368.1   | 18146                       
|
| 16384 | 52.5            | 1353.1  | inf (it ran out of memeory) 
|
```
Aug 24
next sibling parent Bienlein <ffm2002 web.de> writes:
This is nice. Would be good if your StringBuilder class makes it 
into Phobos somehow as it is of general use :-). Java's 
StringBuilder class was first added to JDK1.5 which was released 
in 2004 ...
Aug 24
prev sibling next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
 Good Morning,

 I just saw the need to concatenate a long-ish string at ctfe.

 TLDR; writing your own specialized chunking appender is way 
 faster than naive CTFE
 using the phobos provided appender seems to be a bad idea ;) at 
 least at CTFE.
I should have tested the code more thoroughly! The reason why the string builder had so a flat performance curve was A BUG. It just did not concatenate anything to the buffer after the initial was full. This issue is now fixed and had the pleasant side effect of removing the `goto`s from the code the fixed version looks like this: ```d struct StringBuilder { enum CHUNK_SIZE = 4096; char[CHUNK_SIZE] fixedBuffer; size_t fixedBufferUsed; char[CHUNK_SIZE][] chunks; size_t[] chunk_lengths; ushort last_chunk_used; void addToNewChunk(const(char)[] s) { if (s.length <= CHUNK_SIZE) { chunks.length = last_chunk_used + 1; chunk_lengths.length = last_chunk_used + 1; last_chunk_used += 1; appendToLastChunk(s); } else { assert(0, "string to be appended is longer than " ~ CHUNK_SIZE.stringof ~ " this is not supported currently"); } } void appendToLastChunk(const(char)[] s) { auto chunkBegin = chunk_lengths[last_chunk_used - 1]; auto chunkEnd = chunkBegin + s.length; if (chunkEnd < CHUNK_SIZE) { chunks[last_chunk_used - 1][chunkBegin .. chunkEnd] = s; chunk_lengths[last_chunk_used - 1] = chunkEnd; } else addToNewChunk(s); } void appendString(const(char)[] s) { // can we used our preallocated fixed buffer? if (!last_chunk_used) { auto fixedBufferEnd = fixedBufferUsed + s.length; if (fixedBufferEnd < CHUNK_SIZE) { fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = s; fixedBufferUsed = fixedBufferEnd; } else { addToNewChunk(s); } } else appendToLastChunk(s); } string releaseBuffer() { size_t result_length = fixedBufferUsed; foreach (i; 0 .. last_chunk_used) { result_length += chunk_lengths[i]; } char[] result; result.length = result_length; result[0 .. fixedBufferUsed] = fixedBuffer[0 .. fixedBufferUsed]; size_t result_begin_pos = fixedBufferUsed; foreach (i; 0 .. last_chunk_used - 1) { const chunk_used = chunk_lengths[i]; result[result_begin_pos .. result_begin_pos + chunk_used] = chunks[i][0 .. chunk_used]; result_begin_pos += chunk_used; } return cast(string) result; } } ``` And now the resulting table is somewhat different as well: ``` | N | stringBuilder | naive | Appender | |-------|-----------------|---------|-----------------------------| | 32 | 27.8 | 27.8 | 53 | | 128 | 29.8 | 27.2 | 54.1 | | 512 | 33.7 | 29.8 | 121 | | 1024 | 39.8 | 34.4 | 313.9 | | 2048 | 52.0 | 52.9 | 1047 | | 4096 | 77.0 | 123.9 | 4350 | | 8192 | 127.3 | 378.2 | 18146 | | 10000 | 151.1 | 548.6 | 26796 | | 16384 | 233.7 | 1386 | inf (it ran out of memory) | ``` I inserted another row which is the N the appender can barely do before I run out of memory. or memeory which is special memory to be used in memes. We can see now the string build behaves linear rather than almost constant which is much more what I would have expected. Now that were results for CTFE; Let's now look at how it looks at regular runtime. We compile with `ldc -O3` we also only care for the table starting at 10000 we also add N == 0 to account for compile time as we are measuring compiletime and runtime together I will write out only the difference from N0 ``` | N | stringBuilder | naive | Appender | |-------|-----------------|---------|----------| | 0 |-147.0 |-100.0 |-228.7 | | 16384 | 12.5 | 6.2 | 3.7 | | 32768 | 23.5 | 7.9 | 5.0 | | 65536 | 19.6 | 7.6 | 5.6 | | 131072| 37.0 | 15.2 | 14.7 | | 262144| 44.4 | 24.3 | 16.3 | | 524288| 108.8 | 30.6 | 21.1 | |1048576| 236.5 | 55.4 | 43.7 | |2097152| 369.4 | 101.1 | 72.3 | |4194304| 823.2 | 198.0 | 150.1 | ``` We can see that the runtime performance of the string builder is quite bad as we scale. and that the naive version is quite good even under extreme stress. The Appender is somewhat faster at runtime than the naive version but not a whole lot. And it takes twice the compile time. So for code that is supposed to be at runtime a quick `~=` in the code is not bad at all. If it's very heavily used the Appender can make a difference, at runtime as well. But it should absolutely be avoided if you are going to use the same code at CTFE. The real lesson though is to be very skeptical of extermely good results and verify behavior as well as performance! :) Cheers, Stefan
Aug 24
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/24/21 9:06 AM, Stefan Koch wrote:

 The reason why the string builder had so a flat performance curve was A
 BUG.
I suspected that as soon as saw your initial post. :) Happens to me too.
 results for CTFE;
I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b); Not that I will change my style but I always wonder how it compares to the following horrible one. :) string s; s ~= "hello "; s ~= a; // Let's assume 'a' and 'b' are strings s ~= " world "; s ~= b; Thank you, Ali
Aug 24
next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 I know it's not the same use case but I've standardized on the 
 following style for string formatting both for CTFE and run 
 time:

   auto s = format!"hello %s world %s"(a, b);
yikes that's like the worst thing in the whole D world. brutally slow by all metrics.
Aug 24
parent reply Alexandru Ermicioi <alexandru.ermicioi gmail.com> writes:
On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
 On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 I know it's not the same use case but I've standardized on the 
 following style for string formatting both for CTFE and run 
 time:

   auto s = format!"hello %s world %s"(a, b);
yikes that's like the worst thing in the whole D world. brutally slow by all metrics.
Best here from use point of view is text method for simple use cases: text("Hello ", a, " world ", b); Hope it is better on perf at ctfe than format. Regards, Alexandru.
Aug 24
parent reply Paul Backus <snarwin gmail.com> writes:
On Tuesday, 24 August 2021 at 19:35:39 UTC, Alexandru Ermicioi 
wrote:
 On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
 On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 I know it's not the same use case but I've standardized on 
 the following style for string formatting both for CTFE and 
 run time:

     auto s = format!"hello %s world %s"(a, b);
yikes that's like the worst thing in the whole D world. brutally slow by all metrics.
Best here from use point of view is text method for simple use cases: text("Hello ", a, " world ", b); Hope it is better on perf at ctfe than format.
Looks like it'd be worse. Internally, `format` uses `formattedWrite` with an `Appender` as the output range [1], which means that (in the best case) it can write its result directly into the output buffer without allocating any memory for intermediate results. `text`, on the other hand, converts each of its non-string arguments to a string separately *before* writing them to its output buffer [2], so even in the best case it will require O(arguments) additional allocations compared to `format`. I am honestly kind of surprised that `text` works like this--before looking, I'd have expected both to use the same mechanism under the hood. Maybe it's necessary to avoid some edge-case inconsistency between `value.to!string` and `value.format!"%s"`? [1] https://phobos.dpldocs.info/source/std.format.write.d.html#L681 [2] https://dpldocs.info/experimental-docs/source/std.conv.d.html#L4233
Aug 24
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/24/21 1:14 PM, Paul Backus wrote:

 On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
 On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali =C3=87ehreli wrote:
     auto s =3D format!"hello %s world %s"(a, b);
yikes that's like the worst thing in the whole D world. brutally slo=
w
 by all metrics.
[...]
 Internally, `format` uses `formattedWrite`
 with an `Appender` as the output range [1]
That's what I thought. Additionally, some of my structs define=20 toString() functions that work with a "sink" so they would be written=20 into the buffer of an appender without extra string creation: https://dlang.org/library/std/format/write.html To do even better, we could write the following function that reuses a=20 static Appender object: auto appendingFormatter(string fmt, string file =3D __FILE__, size_t line= =20 =3D __LINE__, Args...)(Args args) { import std.array : Appender; import std.format : formattedWrite; static Appender!(char[]) buffer; buffer.clear(); buffer.formattedWrite!fmt(args); return buffer.data; } import std.stdio; void main() { size_t total; enum N =3D 100; foreach (i; 0 .. N) { foreach (d; 0 .. N) { total +=3D appendingFormatter!"hello %s world %s"(i, d).length; total +=3D appendingFormatter!"goodbye %s moon %s"(i * 10, d /=20 7).length; } } writeln(appendingFormatter!"An arbitrary value: %s"(total)); const a =3D appendingFormatter!"%s"(42); const b =3D=20 appendingFormatter!"%s"(43); // BUG: Fails. assert((a =3D=3D "42") && (b =3D=3D "43")); } Performance is awesome because there is almost no memory allocation. Unfortunately, there is that bug on the last line of main because there=20 is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that would differ per=20 __LINE__. Or, is there? Please? :) Do you know a trick that would pass that assertion? Ali
Aug 24
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 8/24/21 3:33 PM, Ali =C3=87ehreli wrote:

 there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that would=20
differ per __LINE__. I found the following two closed PRs that tried to implement C's=20 __COUNTER__ macro: https://github.com/dlang/dmd/pull/10093 https://github.com/dlang/dmd/pull/10131 As mentioned in the second PR, the semantics can be confusing especially = with separate compilation. (I wonder how C works out the semantics.)=20 (And 'static foreach' may be sufficient to handle C's __COUNTER__ for us = in a way that leaves the semantics to the user. (I am not sure about=20 that...)) Actually, for this particular use case, I would be happy with a=20 __COLUMN__ counter that would resolve to the location on the line e.g.=20 exactly where the first '_' character of __COLUMN__ appeared. Combined=20 with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique = positions. Does that make sense? Ali
Aug 25
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Thursday, 26 August 2021 at 05:41:17 UTC, Ali Çehreli wrote:
 On 8/24/21 3:33 PM, Ali Çehreli wrote:

 there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that
would differ per __LINE__. I found the following two closed PRs that tried to implement C's __COUNTER__ macro: https://github.com/dlang/dmd/pull/10093 https://github.com/dlang/dmd/pull/10131 As mentioned in the second PR, the semantics can be confusing especially with separate compilation. (I wonder how C works out the semantics.) (And 'static foreach' may be sufficient to handle C's __COUNTER__ for us in a way that leaves the semantics to the user. (I am not sure about that...)) Actually, for this particular use case, I would be happy with a __COLUMN__ counter that would resolve to the location on the line e.g. exactly where the first '_' character of __COLUMN__ appeared. Combined with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique positions. Does that make sense? Ali
The linked `__COUNTER__` (10131) is actually fine. I think it does to work inside a static foreach as well. whereas `__COLUMN__` would not give you unique identifiers in a static foreach.
Aug 26
prev sibling parent user1234 <user1234 12.de> writes:
On Thursday, 26 August 2021 at 05:41:17 UTC, Ali Çehreli wrote:
 On 8/24/21 3:33 PM, Ali Çehreli wrote:

 there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that
would differ per __LINE__. I found the following two closed PRs that tried to implement C's __COUNTER__ macro: https://github.com/dlang/dmd/pull/10093 https://github.com/dlang/dmd/pull/10131 As mentioned in the second PR, the semantics can be confusing especially with separate compilation. (I wonder how C works out the semantics.) (And 'static foreach' may be sufficient to handle C's __COUNTER__ for us in a way that leaves the semantics to the user. (I am not sure about that...)) Actually, for this particular use case, I would be happy with a __COLUMN__ counter that would resolve to the location on the line e.g. exactly where the first '_' character of __COLUMN__ appeared. Combined with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique positions. Does that make sense?
Yes, dmd uses something similar to generate unittests names. __COUNTER__ would have been fine to name symbols because the mangles start with the module name so separate compilation would not have caused multiple definitions errors.
Aug 26
prev sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 On 8/24/21 9:06 AM, Stefan Koch wrote:

 The reason why the string builder had so a flat performance
curve was A
 BUG.
I suspected that as soon as saw your initial post. :) Happens to me too.
 results for CTFE;
I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b); Not that I will change my style but I always wonder how it compares to the following horrible one. :) string s; s ~= "hello "; s ~= a; // Let's assume 'a' and 'b' are strings s ~= " world "; s ~= b; Thank you, Ali
The compile-time performance of std.format is bad. Just importing the module costs you a lot. Instantiating some amount of templates per format string and parameter type-sets, is also not going to make you particularity happy. As for the runtime performance I couldn't comment. I assume it depends on your usecase.
Aug 24
prev sibling parent reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
 Good Morning,
Would it be motivated to have the compiler deduce that `result ~= "whatever"` can be lowered to a reallocating append because `result` isn't aliased: ```d enum x = () { string result; foreach(_; 0 ..n_iterations) result ~= "whatever" return result; } () ```
Aug 24
parent Stefan Koch <uplink.coder googlemail.com> writes:
On Tuesday, 24 August 2021 at 16:16:50 UTC, Per Nordlöw wrote:
 On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
 Good Morning,
Would it be motivated to have the compiler deduce that `result ~= "whatever"` can be lowered to a reallocating append because `result` isn't aliased: ```d enum x = () { string result; foreach(_; 0 ..n_iterations) result ~= "whatever" return result; } () ```
Of course you could try to recognize certain patterns and treat them specially, but that complicates the compiler and now your compile time performance is dependent on a bunch of heuristics, which in my experience have a habit of failing in the most inconvenient ways and situations.
Aug 24