digitalmars.D - Sometimes 100 lines of code can fix your ctfe perf
- Stefan Koch (177/177) Aug 24 2021 Good Morning,
- Bienlein (4/4) Aug 24 2021 This is nice. Would be good if your StringBuilder class makes it
- Stefan Koch (158/164) Aug 24 2021 I should have tested the code more thoroughly!
- =?UTF-8?Q?Ali_=c3=87ehreli?= (14/17) Aug 24 2021 I know it's not the same use case but I've standardized on the following...
- Adam D Ruppe (3/7) Aug 24 2021 yikes that's like the worst thing in the whole D world. brutally
- Alexandru Ermicioi (7/15) Aug 24 2021 Best here from use point of view is text method for simple use
- Paul Backus (19/33) Aug 24 2021 Looks like it'd be worse. Internally, `format` uses
- =?UTF-8?Q?Ali_=c3=87ehreli?= (42/50) Aug 24 2021 [...]
- =?UTF-8?Q?Ali_=c3=87ehreli?= (18/19) Aug 25 2021 differ per __LINE__.
- Stefan Koch (5/24) Aug 26 2021 The linked `__COUNTER__` (10131) is actually fine.
- user1234 (5/23) Aug 26 2021 Yes, dmd uses something similar to generate unittests names.
- Stefan Koch (8/28) Aug 24 2021 The compile-time performance of std.format is bad.
- Per =?UTF-8?B?Tm9yZGzDtnc=?= (10/11) Aug 24 2021 Would it be motivated to have the compiler deduce that `result ~=
- Stefan Koch (7/18) Aug 24 2021 Of course you could try to recognize certain patterns and treat
Good Morning, I just saw the need to concatenate a long-ish string at ctfe. TLDR; writing your own specialized chunking appender is way faster than naive CTFE using the phobos provided appender seems to be a bad idea ;) at least at CTFE. As you may know when you try to so this the naive way: ```d enum x = () { string result; foreach(_; 0 ..n_iterations) result ~= "whatever" return result; } () ``` it scales quite poorly as `n_iterations` grows. and eventually you'll run out of memory and time. it turns out ctfe really doesn't like appending because it doesn't preallocate buffers or keep capacity around. So what if we made our own string-builder that uses whole chunks to do the appending in? in about 70 lines of code we can build our own string build that appends strings. ```d module stringBuilder; struct StringBuilder { enum CHUNK_SIZE = 4096; char[CHUNK_SIZE] fixedBuffer; size_t fixedBufferUsed; char[CHUNK_SIZE][] chunks; size_t[] chunk_lengths; ushort last_chunk_used; void appendString(const(char)[] s) { // can we used our preallocated fixed buffer? if (!last_chunk_used) { auto fixedBufferEnd = fixedBufferUsed + s.length; if (fixedBufferEnd < CHUNK_SIZE) { fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = s; fixedBufferUsed = fixedBufferEnd; } } else { LappendToLastChunk: auto chunkBegin = chunk_lengths[last_chunk_used - 1]; auto chunkEnd = chunkBegin + s.length; if (chunkEnd < CHUNK_SIZE) { chunks[last_chunk_used - 1][chunkBegin .. chunkEnd] = s; chunk_lengths[last_chunk_used - 1] = chunkEnd; } else if (s.length <= CHUNK_SIZE) { chunks.length = last_chunk_used + 1; last_chunk_used += 1; goto LappendToLastChunk; } else { assert(0, "string to be appended is longer than " ~ CHUNK_SIZE.stringof ~ " this is not supported currently"); } } } string releaseBuffer() { size_t result_length = fixedBufferUsed; foreach (i; 0 .. last_chunk_used - 1) { result_length += chunk_lengths[i]; } char[] result; result.length = result_length; result[0 .. fixedBufferUsed] = fixedBuffer[0 .. fixedBufferUsed]; size_t result_begin_pos = fixedBufferUsed; foreach (i; 0 .. last_chunk_used - 1) { const chunk_used = chunk_lengths[i]; result[result_begin_pos .. result_begin_pos + chunk_used] = chunks[i][0 .. chunk_used]; result_begin_pos += chunk_used; } return cast(string) result; } } ``` that wasn't too bad was it? But will such a simple piece of code really be able to speed us up? lets do a quick perf test. ```d import stringBuilder; static if (is(typeof(import ("N_ITERATIONS")))) { enum n_iterations = mixin(import("N_ITERATIONS")); } else { enum n_iterations = 4096; } pragma(msg, "N: ", n_iterations); enum big_string = () { version (StringBuilder) { StringBuilder b; } else version (Appender) { import std.array : Appender; Appender!(string) result; } else { string result; } foreach(_; 0 ..n_iterations) { version(StringBuilder) { b.appendString("Let us see ... how meta can you go?\n"); } else version (Appender) { result ~= ("Let us see ... how meta can you go?\n"); } else { result ~= ("Let us see ... how meta can you go?\n"); } } version (StringBuilder) { return b.releaseBuffer(); } else version (Appender) { return result.opSlice(); } else return result; } (); pragma(msg, big_string); ``` And here are the results for various values of N_ITERATIONS: (times are in milliseconds and are the time) `dmd -c takes` ``` | N | stringBuilder | naive | Appender | |-------|-----------------|---------|-----------------------------| | 32 | 28.1 | 28 | 53 | | 128 | 29.0 | 27.3 | 58 | | 512 | 29.4 | 29.7 | 120.4 | | 1024 | 30.6 | 33.7 | 304.4 | | 2048 | 31.8 | 51.2 | 1055 | | 4096 | 35.1 | 120.8 | 4338 | | 8192 | 41.0 | 368.1 | 18146 | | 16384 | 52.5 | 1353.1 | inf (it ran out of memeory) | ```
Aug 24 2021
This is nice. Would be good if your StringBuilder class makes it into Phobos somehow as it is of general use :-). Java's StringBuilder class was first added to JDK1.5 which was released in 2004 ...
Aug 24 2021
On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:Good Morning, I just saw the need to concatenate a long-ish string at ctfe. TLDR; writing your own specialized chunking appender is way faster than naive CTFE using the phobos provided appender seems to be a bad idea ;) at least at CTFE.I should have tested the code more thoroughly! The reason why the string builder had so a flat performance curve was A BUG. It just did not concatenate anything to the buffer after the initial was full. This issue is now fixed and had the pleasant side effect of removing the `goto`s from the code the fixed version looks like this: ```d struct StringBuilder { enum CHUNK_SIZE = 4096; char[CHUNK_SIZE] fixedBuffer; size_t fixedBufferUsed; char[CHUNK_SIZE][] chunks; size_t[] chunk_lengths; ushort last_chunk_used; void addToNewChunk(const(char)[] s) { if (s.length <= CHUNK_SIZE) { chunks.length = last_chunk_used + 1; chunk_lengths.length = last_chunk_used + 1; last_chunk_used += 1; appendToLastChunk(s); } else { assert(0, "string to be appended is longer than " ~ CHUNK_SIZE.stringof ~ " this is not supported currently"); } } void appendToLastChunk(const(char)[] s) { auto chunkBegin = chunk_lengths[last_chunk_used - 1]; auto chunkEnd = chunkBegin + s.length; if (chunkEnd < CHUNK_SIZE) { chunks[last_chunk_used - 1][chunkBegin .. chunkEnd] = s; chunk_lengths[last_chunk_used - 1] = chunkEnd; } else addToNewChunk(s); } void appendString(const(char)[] s) { // can we used our preallocated fixed buffer? if (!last_chunk_used) { auto fixedBufferEnd = fixedBufferUsed + s.length; if (fixedBufferEnd < CHUNK_SIZE) { fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = s; fixedBufferUsed = fixedBufferEnd; } else { addToNewChunk(s); } } else appendToLastChunk(s); } string releaseBuffer() { size_t result_length = fixedBufferUsed; foreach (i; 0 .. last_chunk_used) { result_length += chunk_lengths[i]; } char[] result; result.length = result_length; result[0 .. fixedBufferUsed] = fixedBuffer[0 .. fixedBufferUsed]; size_t result_begin_pos = fixedBufferUsed; foreach (i; 0 .. last_chunk_used - 1) { const chunk_used = chunk_lengths[i]; result[result_begin_pos .. result_begin_pos + chunk_used] = chunks[i][0 .. chunk_used]; result_begin_pos += chunk_used; } return cast(string) result; } } ``` And now the resulting table is somewhat different as well: ``` | N | stringBuilder | naive | Appender | |-------|-----------------|---------|-----------------------------| | 32 | 27.8 | 27.8 | 53 | | 128 | 29.8 | 27.2 | 54.1 | | 512 | 33.7 | 29.8 | 121 | | 1024 | 39.8 | 34.4 | 313.9 | | 2048 | 52.0 | 52.9 | 1047 | | 4096 | 77.0 | 123.9 | 4350 | | 8192 | 127.3 | 378.2 | 18146 | | 10000 | 151.1 | 548.6 | 26796 | | 16384 | 233.7 | 1386 | inf (it ran out of memory) | ``` I inserted another row which is the N the appender can barely do before I run out of memory. or memeory which is special memory to be used in memes. We can see now the string build behaves linear rather than almost constant which is much more what I would have expected. Now that were results for CTFE; Let's now look at how it looks at regular runtime. We compile with `ldc -O3` we also only care for the table starting at 10000 we also add N == 0 to account for compile time as we are measuring compiletime and runtime together I will write out only the difference from N0 ``` | N | stringBuilder | naive | Appender | |-------|-----------------|---------|----------| | 0 |-147.0 |-100.0 |-228.7 | | 16384 | 12.5 | 6.2 | 3.7 | | 32768 | 23.5 | 7.9 | 5.0 | | 65536 | 19.6 | 7.6 | 5.6 | | 131072| 37.0 | 15.2 | 14.7 | | 262144| 44.4 | 24.3 | 16.3 | | 524288| 108.8 | 30.6 | 21.1 | |1048576| 236.5 | 55.4 | 43.7 | |2097152| 369.4 | 101.1 | 72.3 | |4194304| 823.2 | 198.0 | 150.1 | ``` We can see that the runtime performance of the string builder is quite bad as we scale. and that the naive version is quite good even under extreme stress. The Appender is somewhat faster at runtime than the naive version but not a whole lot. And it takes twice the compile time. So for code that is supposed to be at runtime a quick `~=` in the code is not bad at all. If it's very heavily used the Appender can make a difference, at runtime as well. But it should absolutely be avoided if you are going to use the same code at CTFE. The real lesson though is to be very skeptical of extermely good results and verify behavior as well as performance! :) Cheers, Stefan
Aug 24 2021
On 8/24/21 9:06 AM, Stefan Koch wrote:The reason why the string builder had so a flat performance curve was A BUG.I suspected that as soon as saw your initial post. :) Happens to me too.results for CTFE;I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b); Not that I will change my style but I always wonder how it compares to the following horrible one. :) string s; s ~= "hello "; s ~= a; // Let's assume 'a' and 'b' are strings s ~= " world "; s ~= b; Thank you, Ali
Aug 24 2021
On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b);yikes that's like the worst thing in the whole D world. brutally slow by all metrics.
Aug 24 2021
On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:Best here from use point of view is text method for simple use cases: text("Hello ", a, " world ", b); Hope it is better on perf at ctfe than format. Regards, Alexandru.I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b);yikes that's like the worst thing in the whole D world. brutally slow by all metrics.
Aug 24 2021
On Tuesday, 24 August 2021 at 19:35:39 UTC, Alexandru Ermicioi wrote:On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:Looks like it'd be worse. Internally, `format` uses `formattedWrite` with an `Appender` as the output range [1], which means that (in the best case) it can write its result directly into the output buffer without allocating any memory for intermediate results. `text`, on the other hand, converts each of its non-string arguments to a string separately *before* writing them to its output buffer [2], so even in the best case it will require O(arguments) additional allocations compared to `format`. I am honestly kind of surprised that `text` works like this--before looking, I'd have expected both to use the same mechanism under the hood. Maybe it's necessary to avoid some edge-case inconsistency between `value.to!string` and `value.format!"%s"`? [1] https://phobos.dpldocs.info/source/std.format.write.d.html#L681 [2] https://dpldocs.info/experimental-docs/source/std.conv.d.html#L4233On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:Best here from use point of view is text method for simple use cases: text("Hello ", a, " world ", b); Hope it is better on perf at ctfe than format.I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b);yikes that's like the worst thing in the whole D world. brutally slow by all metrics.
Aug 24 2021
On 8/24/21 1:14 PM, Paul Backus wrote:On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali =C3=87ehreli wrote:wauto s =3D format!"hello %s world %s"(a, b);yikes that's like the worst thing in the whole D world. brutally slo=[...]by all metrics.Internally, `format` uses `formattedWrite` with an `Appender` as the output range [1]That's what I thought. Additionally, some of my structs define=20 toString() functions that work with a "sink" so they would be written=20 into the buffer of an appender without extra string creation: https://dlang.org/library/std/format/write.html To do even better, we could write the following function that reuses a=20 static Appender object: auto appendingFormatter(string fmt, string file =3D __FILE__, size_t line= =20 =3D __LINE__, Args...)(Args args) { import std.array : Appender; import std.format : formattedWrite; static Appender!(char[]) buffer; buffer.clear(); buffer.formattedWrite!fmt(args); return buffer.data; } import std.stdio; void main() { size_t total; enum N =3D 100; foreach (i; 0 .. N) { foreach (d; 0 .. N) { total +=3D appendingFormatter!"hello %s world %s"(i, d).length; total +=3D appendingFormatter!"goodbye %s moon %s"(i * 10, d /=20 7).length; } } writeln(appendingFormatter!"An arbitrary value: %s"(total)); const a =3D appendingFormatter!"%s"(42); const b =3D=20 appendingFormatter!"%s"(43); // BUG: Fails. assert((a =3D=3D "42") && (b =3D=3D "43")); } Performance is awesome because there is almost no memory allocation. Unfortunately, there is that bug on the last line of main because there=20 is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that would differ per=20 __LINE__. Or, is there? Please? :) Do you know a trick that would pass that assertion? Ali
Aug 24 2021
On 8/24/21 3:33 PM, Ali =C3=87ehreli wrote:there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that would=20differ per __LINE__. I found the following two closed PRs that tried to implement C's=20 __COUNTER__ macro: https://github.com/dlang/dmd/pull/10093 https://github.com/dlang/dmd/pull/10131 As mentioned in the second PR, the semantics can be confusing especially = with separate compilation. (I wonder how C works out the semantics.)=20 (And 'static foreach' may be sufficient to handle C's __COUNTER__ for us = in a way that leaves the semantics to the user. (I am not sure about=20 that...)) Actually, for this particular use case, I would be happy with a=20 __COLUMN__ counter that would resolve to the location on the line e.g.=20 exactly where the first '_' character of __COLUMN__ appeared. Combined=20 with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique = positions. Does that make sense? Ali
Aug 25 2021
On Thursday, 26 August 2021 at 05:41:17 UTC, Ali Çehreli wrote:On 8/24/21 3:33 PM, Ali Çehreli wrote:The linked `__COUNTER__` (10131) is actually fine. I think it does to work inside a static foreach as well. whereas `__COLUMN__` would not give you unique identifiers in a static foreach.there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. thatwould differ per __LINE__. I found the following two closed PRs that tried to implement C's __COUNTER__ macro: https://github.com/dlang/dmd/pull/10093 https://github.com/dlang/dmd/pull/10131 As mentioned in the second PR, the semantics can be confusing especially with separate compilation. (I wonder how C works out the semantics.) (And 'static foreach' may be sufficient to handle C's __COUNTER__ for us in a way that leaves the semantics to the user. (I am not sure about that...)) Actually, for this particular use case, I would be happy with a __COLUMN__ counter that would resolve to the location on the line e.g. exactly where the first '_' character of __COLUMN__ appeared. Combined with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique positions. Does that make sense? Ali
Aug 26 2021
On Thursday, 26 August 2021 at 05:41:17 UTC, Ali Çehreli wrote:On 8/24/21 3:33 PM, Ali Çehreli wrote:Yes, dmd uses something similar to generate unittests names. __COUNTER__ would have been fine to name symbols because the mangles start with the module name so separate compilation would not have caused multiple definitions errors.there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. thatwould differ per __LINE__. I found the following two closed PRs that tried to implement C's __COUNTER__ macro: https://github.com/dlang/dmd/pull/10093 https://github.com/dlang/dmd/pull/10131 As mentioned in the second PR, the semantics can be confusing especially with separate compilation. (I wonder how C works out the semantics.) (And 'static foreach' may be sufficient to handle C's __COUNTER__ for us in a way that leaves the semantics to the user. (I am not sure about that...)) Actually, for this particular use case, I would be happy with a __COLUMN__ counter that would resolve to the location on the line e.g. exactly where the first '_' character of __COLUMN__ appeared. Combined with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique positions. Does that make sense?
Aug 26 2021
On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:On 8/24/21 9:06 AM, Stefan Koch wrote:The compile-time performance of std.format is bad. Just importing the module costs you a lot. Instantiating some amount of templates per format string and parameter type-sets, is also not going to make you particularity happy. As for the runtime performance I couldn't comment. I assume it depends on your usecase.The reason why the string builder had so a flat performancecurve was ABUG.I suspected that as soon as saw your initial post. :) Happens to me too.results for CTFE;I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time: auto s = format!"hello %s world %s"(a, b); Not that I will change my style but I always wonder how it compares to the following horrible one. :) string s; s ~= "hello "; s ~= a; // Let's assume 'a' and 'b' are strings s ~= " world "; s ~= b; Thank you, Ali
Aug 24 2021
On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:Good Morning,Would it be motivated to have the compiler deduce that `result ~= "whatever"` can be lowered to a reallocating append because `result` isn't aliased: ```d enum x = () { string result; foreach(_; 0 ..n_iterations) result ~= "whatever" return result; } () ```
Aug 24 2021
On Tuesday, 24 August 2021 at 16:16:50 UTC, Per Nordlöw wrote:On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:Of course you could try to recognize certain patterns and treat them specially, but that complicates the compiler and now your compile time performance is dependent on a bunch of heuristics, which in my experience have a habit of failing in the most inconvenient ways and situations.Good Morning,Would it be motivated to have the compiler deduce that `result ~= "whatever"` can be lowered to a reallocating append because `result` isn't aliased: ```d enum x = () { string result; foreach(_; 0 ..n_iterations) result ~= "whatever" return result; } () ```
Aug 24 2021