digitalmars.D - Sometimes 100 lines of code can fix your ctfe perf

Stefan Koch (177/177) Aug 24 2021 Good Morning,

Bienlein (4/4) Aug 24 2021 This is nice. Would be good if your StringBuilder class makes it
Stefan Koch (158/164) Aug 24 2021 I should have tested the code more thoroughly!

=?UTF-8?Q?Ali_=c3=87ehreli?= (14/17) Aug 24 2021 I know it's not the same use case but I've standardized on the following...

Adam D Ruppe (3/7) Aug 24 2021 yikes that's like the worst thing in the whole D world. brutally

Alexandru Ermicioi (7/15) Aug 24 2021 Best here from use point of view is text method for simple use

Paul Backus (19/33) Aug 24 2021 Looks like it'd be worse. Internally, `format` uses

=?UTF-8?Q?Ali_=c3=87ehreli?= (42/50) Aug 24 2021 [...]

=?UTF-8?Q?Ali_=c3=87ehreli?= (18/19) Aug 25 2021 differ per __LINE__.

Stefan Koch (5/24) Aug 26 2021 The linked `__COUNTER__` (10131) is actually fine.
user1234 (5/23) Aug 26 2021 Yes, dmd uses something similar to generate unittests names.

Stefan Koch (8/28) Aug 24 2021 The compile-time performance of std.format is bad.

Per =?UTF-8?B?Tm9yZGzDtnc=?= (10/11) Aug 24 2021 Would it be motivated to have the compiler deduce that `result ~=

Stefan Koch (7/18) Aug 24 2021 Of course you could try to recognize certain patterns and treat

Stefan Koch <uplink.coder googlemail.com> writes:

Good Morning,

I just saw the need to concatenate a long-ish string at ctfe.

TLDR; writing your own specialized chunking appender is way 
faster than naive CTFE
using the phobos provided appender seems to be a bad idea ;) at 
least at CTFE.

As you may know when you try to so this the naive way:
```d
enum x = () { string result;
foreach(_; 0 ..n_iterations)
   result ~= "whatever"
return result; } ()
```

it scales quite poorly as `n_iterations` grows.
and eventually you'll run out of memory and time.

it turns out ctfe really doesn't like appending because it 
doesn't preallocate buffers or keep capacity around.

So what if we made our own string-builder that uses whole chunks 
to do the appending in?
in about 70 lines of code we can build our own string build that 
appends strings.

```d
module stringBuilder;
struct StringBuilder
{
     enum CHUNK_SIZE = 4096;
     char[CHUNK_SIZE] fixedBuffer;
     size_t fixedBufferUsed;

     char[CHUNK_SIZE][] chunks;
     size_t[] chunk_lengths;

     ushort last_chunk_used;

     void appendString(const(char)[] s)
     {
         // can we used our preallocated fixed buffer?
         if (!last_chunk_used)
         {
             auto fixedBufferEnd = fixedBufferUsed + s.length;
             if (fixedBufferEnd < CHUNK_SIZE)
             {
                 fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = 
s;
                 fixedBufferUsed = fixedBufferEnd;
             }
         }
         else
         {
         LappendToLastChunk:
             auto chunkBegin = chunk_lengths[last_chunk_used - 1];
             auto chunkEnd = chunkBegin + s.length;
             if (chunkEnd < CHUNK_SIZE)
             {
                 chunks[last_chunk_used - 1][chunkBegin .. 
chunkEnd] = s;
                 chunk_lengths[last_chunk_used - 1] = chunkEnd;
             }
             else if (s.length <= CHUNK_SIZE)
             {
                 chunks.length = last_chunk_used + 1;
                 last_chunk_used += 1;
                 goto LappendToLastChunk;
             }
             else
             {
                 assert(0,
                         "string to be appended is longer than "
                         ~ CHUNK_SIZE.stringof ~ " this is not 
supported currently");
             }
         }
     }

     string releaseBuffer()
     {
         size_t result_length = fixedBufferUsed;
         foreach (i; 0 .. last_chunk_used - 1)
         {
             result_length += chunk_lengths[i];
         }

         char[] result;
         result.length = result_length;
         result[0 .. fixedBufferUsed] = fixedBuffer[0 .. 
fixedBufferUsed];
         size_t result_begin_pos = fixedBufferUsed;

         foreach (i; 0 .. last_chunk_used - 1)
         {
             const chunk_used = chunk_lengths[i];
             result[result_begin_pos .. result_begin_pos + 
chunk_used] = chunks[i][0 .. chunk_used];
             result_begin_pos += chunk_used;
         }
         return cast(string) result;
     }
}
```

that wasn't too bad was it?
But will such a simple piece of code really be able to speed us 
up?

lets do a quick perf test.

```d
import stringBuilder;
static if (is(typeof(import ("N_ITERATIONS"))))
{
     enum n_iterations = mixin(import("N_ITERATIONS"));
}
else
{
     enum n_iterations = 4096;
}

pragma(msg, "N: ", n_iterations);

enum big_string =
() {
     version (StringBuilder)
     {
         StringBuilder b;
     }
     else version (Appender)
     {
         import std.array : Appender;
         Appender!(string) result;
     }
     else
     {
         string result;
     }

     foreach(_; 0 ..n_iterations)
     {
         version(StringBuilder)
         {
             b.appendString("Let us see ... how meta can you 
go?\n");
         }
         else version (Appender)
         {
             result ~=     ("Let us see ... how meta can you 
go?\n");
         }
         else
         {
             result ~=     ("Let us see ... how meta can you 
go?\n");
         }
     }
     version (StringBuilder)
     {
         return b.releaseBuffer();
     }
     else version (Appender)
     {
         return result.opSlice();
     }
     else
         return result;
} ();

pragma(msg, big_string);
```

And here are the results for various values of N_ITERATIONS:
(times are in milliseconds and are the time) `dmd -c takes`
```
|   N   |  stringBuilder  | naive   | Appender                    
|
|-------|-----------------|---------|-----------------------------|
| 32    | 28.1            | 28      | 53                          
|
| 128   | 29.0            | 27.3    | 58                          
|
| 512   | 29.4            | 29.7    | 120.4                       
|
| 1024  | 30.6            | 33.7    | 304.4                       
|
| 2048  | 31.8            | 51.2    | 1055                        
|
| 4096  | 35.1            | 120.8   | 4338                        
|
| 8192  | 41.0            | 368.1   | 18146                       
|
| 16384 | 52.5            | 1353.1  | inf (it ran out of memeory) 
|
```

Aug 24 2021

Bienlein <ffm2002 web.de> writes:

This is nice. Would be good if your StringBuilder class makes it 
into Phobos somehow as it is of general use :-). Java's 
StringBuilder class was first added to JDK1.5 which was released 
in 2004 ...

Aug 24 2021

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
 Good Morning,

 I just saw the need to concatenate a long-ish string at ctfe.

 TLDR; writing your own specialized chunking appender is way 
 faster than naive CTFE
 using the phobos provided appender seems to be a bad idea ;) at 
 least at CTFE.

I should have tested the code more thoroughly!
The reason why the string builder had so a flat performance curve 
was A BUG.
It just did not concatenate anything to the buffer after the 
initial was full.

This issue is now fixed and had the pleasant side effect of 
removing the `goto`s from the code

the fixed version looks like this:
```d
struct StringBuilder
{
     enum CHUNK_SIZE = 4096;
     char[CHUNK_SIZE] fixedBuffer;
     size_t fixedBufferUsed;

     char[CHUNK_SIZE][] chunks;
     size_t[] chunk_lengths;

     ushort last_chunk_used;

     void addToNewChunk(const(char)[] s)
     {
             if (s.length <= CHUNK_SIZE)
             {
                 chunks.length = last_chunk_used + 1;
                 chunk_lengths.length = last_chunk_used + 1;
                 last_chunk_used += 1;
                 appendToLastChunk(s);
             }
             else
             {
                 assert(0,
                         "string to be appended is longer than "
                         ~ CHUNK_SIZE.stringof ~ " this is not 
supported currently");
             }
     }

     void appendToLastChunk(const(char)[] s)
     {
             auto chunkBegin = chunk_lengths[last_chunk_used - 1];
             auto chunkEnd = chunkBegin + s.length;
             if (chunkEnd < CHUNK_SIZE)
             {
                 chunks[last_chunk_used - 1][chunkBegin .. 
chunkEnd] = s;
                 chunk_lengths[last_chunk_used - 1] = chunkEnd;
             }
             else
                 addToNewChunk(s);

     }

     void appendString(const(char)[] s)
     {

         // can we used our preallocated fixed buffer?
         if (!last_chunk_used)
         {
             auto fixedBufferEnd = fixedBufferUsed + s.length;
             if (fixedBufferEnd < CHUNK_SIZE)
             {
                 fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = 
s;
                 fixedBufferUsed = fixedBufferEnd;
             }
             else
             {
                 addToNewChunk(s);
             }
         }
         else
             appendToLastChunk(s);
     }

     string releaseBuffer()
     {
         size_t result_length = fixedBufferUsed;
         foreach (i; 0 .. last_chunk_used)
         {
             result_length += chunk_lengths[i];
         }

         char[] result;
         result.length = result_length;
         result[0 .. fixedBufferUsed] = fixedBuffer[0 .. 
fixedBufferUsed];
         size_t result_begin_pos = fixedBufferUsed;

         foreach (i; 0 .. last_chunk_used - 1)
         {
             const chunk_used = chunk_lengths[i];
             result[result_begin_pos .. result_begin_pos + 
chunk_used] = chunks[i][0 .. chunk_used];
             result_begin_pos += chunk_used;
         }

         return cast(string) result;
     }

}
```

And now the resulting table is somewhat different as well:

```
|   N   |  stringBuilder  | naive   | Appender                    
|
|-------|-----------------|---------|-----------------------------|
| 32    | 27.8            | 27.8    | 53                          
|
| 128   | 29.8            | 27.2    | 54.1                        
|
| 512   | 33.7            | 29.8    | 121                         
|
| 1024  | 39.8            | 34.4    | 313.9                       
|
| 2048  | 52.0            | 52.9    | 1047                        
|
| 4096  | 77.0            | 123.9   | 4350                        
|
| 8192  | 127.3           | 378.2   | 18146                       
|
| 10000 | 151.1           | 548.6   | 26796                       
|
| 16384 | 233.7           | 1386    | inf (it ran out of memory)  
|
```

I inserted another row which is the N the appender can barely do 
before I run out of memory.
or memeory which is special memory to be used in memes.

We can see now the string build behaves linear rather than almost 
constant which is much more what I would have expected.

Now that were results for CTFE;
Let's now look at how it looks at regular runtime.
We compile with `ldc -O3` we also only care for the table 
starting at 10000
we also add N == 0 to account for compile time as we are 
measuring compiletime and runtime together
I will write out only the difference from N0

```
|   N   |  stringBuilder  | naive   | Appender |
|-------|-----------------|---------|----------|
| 0     |-147.0           |-100.0   |-228.7    |
| 16384 | 12.5            | 6.2     | 3.7      |
| 32768 | 23.5            | 7.9     | 5.0      |
| 65536 | 19.6            | 7.6     | 5.6      |
| 131072| 37.0            | 15.2    | 14.7     |
| 262144| 44.4            | 24.3    | 16.3     |
| 524288| 108.8           | 30.6    | 21.1     |
|1048576| 236.5           | 55.4    | 43.7     |
|2097152| 369.4           | 101.1   | 72.3     |
|4194304| 823.2           | 198.0   | 150.1    |
```

We can see that the runtime performance of the string builder is 
quite bad as we scale.
and that the naive version is quite good even under extreme 
stress.
The Appender is somewhat faster at runtime than the naive version 
but not a whole lot.
And it takes twice the compile time.

So for code that is supposed to be at runtime a quick `~=` in the 
code is not bad at all.
If it's very heavily used the Appender can make a difference,  at 
runtime as well.
But it should absolutely be avoided if you are going to use the 
same code at CTFE.

The real lesson though is to be very skeptical of extermely good 
results and verify behavior as well as performance! :)

Cheers,

Stefan

Aug 24 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 8/24/21 9:06 AM, Stefan Koch wrote:

 The reason why the string builder had so a flat performance curve was A
 BUG.

I suspected that as soon as saw your initial post. :) Happens to me too.

 results for CTFE;

I know it's not the same use case but I've standardized on the following 
style for string formatting both for CTFE and run time:

   auto s = format!"hello %s world %s"(a, b);

Not that I will change my style but I always wonder how it compares to 
the following horrible one. :)

   string s;
   s ~= "hello ";
   s ~= a;        // Let's assume 'a' and 'b' are strings
   s ~= " world ";
   s ~= b;

Thank you,
Ali

Aug 24 2021

Adam D Ruppe <destructionator gmail.com> writes:

On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 I know it's not the same use case but I've standardized on the 
 following style for string formatting both for CTFE and run 
 time:

   auto s = format!"hello %s world %s"(a, b);

yikes that's like the worst thing in the whole D world. brutally 
slow by all metrics.

Aug 24 2021

Alexandru Ermicioi <alexandru.ermicioi gmail.com> writes:

On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
 On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 I know it's not the same use case but I've standardized on the 
 following style for string formatting both for CTFE and run 
 time:

   auto s = format!"hello %s world %s"(a, b);

 yikes that's like the worst thing in the whole D world. 
 brutally slow by all metrics.

Best here from use point of view is text method for simple use 
cases:

text("Hello ", a, " world ", b);

Hope it is better on perf at ctfe than format.

Regards,
Alexandru.

Aug 24 2021

Paul Backus <snarwin gmail.com> writes:

On Tuesday, 24 August 2021 at 19:35:39 UTC, Alexandru Ermicioi 
wrote:
 On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
 On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 I know it's not the same use case but I've standardized on 
 the following style for string formatting both for CTFE and 
 run time:

     auto s = format!"hello %s world %s"(a, b);

 yikes that's like the worst thing in the whole D world. 
 brutally slow by all metrics.

 Best here from use point of view is text method for simple use 
 cases:

     text("Hello ", a, " world ", b);

 Hope it is better on perf at ctfe than format.

Looks like it'd be worse. Internally, `format` uses 
`formattedWrite` with an `Appender` as the output range [1], 
which means that (in the best case) it can write its result 
directly into the output buffer without allocating any memory for 
intermediate results. `text`, on the other hand, converts each of 
its non-string arguments to a string separately *before* writing 
them to its output buffer [2], so even in the best case it will 
require O(arguments) additional allocations compared to `format`.

I am honestly kind of surprised that `text` works like 
this--before looking, I'd have expected both to use the same 
mechanism under the hood. Maybe it's necessary to avoid some 
edge-case inconsistency between `value.to!string` and 
`value.format!"%s"`?

[1] 
https://phobos.dpldocs.info/source/std.format.write.d.html#L681
[2] 
https://dpldocs.info/experimental-docs/source/std.conv.d.html#L4233

Aug 24 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 8/24/21 1:14 PM, Paul Backus wrote:

 On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
 On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali =C3=87ehreli wrote:



     auto s =3D format!"hello %s world %s"(a, b);

 yikes that's like the worst thing in the whole D world. brutally slo=



w
 by all metrics.



[...]

 Internally, `format` uses `formattedWrite`
 with an `Appender` as the output range [1]

That's what I thought. Additionally, some of my structs define=20
toString() functions that work with a "sink" so they would be written=20
into the buffer of an appender without extra string creation:

   https://dlang.org/library/std/format/write.html

To do even better, we could write the following function that reuses a=20
static Appender object:

auto appendingFormatter(string fmt, string file =3D __FILE__, size_t line=
=20
=3D __LINE__, Args...)(Args args) {
   import std.array : Appender;
   import std.format : formattedWrite;

   static Appender!(char[]) buffer;
   buffer.clear();
   buffer.formattedWrite!fmt(args);
   return buffer.data;
}

import std.stdio;

void main() {
   size_t total;

   enum N =3D 100;

   foreach (i; 0 .. N) {
     foreach (d; 0 .. N) {
       total +=3D appendingFormatter!"hello %s world %s"(i, d).length;
       total +=3D appendingFormatter!"goodbye %s moon %s"(i * 10, d /=20
7).length;
     }
   }

   writeln(appendingFormatter!"An arbitrary value: %s"(total));

   const a =3D appendingFormatter!"%s"(42); const b =3D=20
appendingFormatter!"%s"(43);
   // BUG: Fails.
   assert((a =3D=3D "42") && (b =3D=3D "43"));
}

Performance is awesome because there is almost no memory allocation.

Unfortunately, there is that bug on the last line of main because there=20
is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that would differ per=20
__LINE__. Or, is there? Please? :)

Do you know a trick that would pass that assertion?

Ali

Aug 24 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 8/24/21 3:33 PM, Ali =C3=87ehreli wrote:

 there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that would=20

differ per __LINE__.

I found the following two closed PRs that tried to implement C's=20
__COUNTER__ macro:

   https://github.com/dlang/dmd/pull/10093

   https://github.com/dlang/dmd/pull/10131

As mentioned in the second PR, the semantics can be confusing especially =

with separate compilation. (I wonder how C works out the semantics.)=20
(And 'static foreach' may be sufficient to handle C's __COUNTER__ for us =

in a way that leaves the semantics to the user. (I am not sure about=20
that...))

Actually, for this particular use case, I would be happy with a=20
__COLUMN__ counter that would resolve to the location on the line e.g.=20
exactly where the first '_' character of __COLUMN__ appeared. Combined=20
with __FILE_FULL_PATH__ and __LINE__, this would allow us produce unique =

positions.

Does that make sense?

Ali

Aug 25 2021

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 26 August 2021 at 05:41:17 UTC, Ali Çehreli wrote:
 On 8/24/21 3:33 PM, Ali Çehreli wrote:

 there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that

 would differ per __LINE__.

 I found the following two closed PRs that tried to implement 
 C's __COUNTER__ macro:

   https://github.com/dlang/dmd/pull/10093

   https://github.com/dlang/dmd/pull/10131

 As mentioned in the second PR, the semantics can be confusing 
 especially with separate compilation. (I wonder how C works out 
 the semantics.) (And 'static foreach' may be sufficient to 
 handle C's __COUNTER__ for us in a way that leaves the 
 semantics to the user. (I am not sure about that...))

 Actually, for this particular use case, I would be happy with a 
 __COLUMN__ counter that would resolve to the location on the 
 line e.g. exactly where the first '_' character of __COLUMN__ 
 appeared. Combined with __FILE_FULL_PATH__ and __LINE__, this 
 would allow us produce unique positions.

 Does that make sense?

 Ali

The linked `__COUNTER__` (10131)  is actually fine.

I think it does to work inside a static  foreach as well.
whereas `__COLUMN__`   would not give you unique identifiers in a 
static foreach.

Aug 26 2021

user1234 <user1234 12.de> writes:

On Thursday, 26 August 2021 at 05:41:17 UTC, Ali Çehreli wrote:
 On 8/24/21 3:33 PM, Ali Çehreli wrote:

 there is no __COLUMN__ or __ID__ or __COUNTER__ or etc. that

 would differ per __LINE__.

 I found the following two closed PRs that tried to implement 
 C's __COUNTER__ macro:

   https://github.com/dlang/dmd/pull/10093

   https://github.com/dlang/dmd/pull/10131

 As mentioned in the second PR, the semantics can be confusing 
 especially with separate compilation. (I wonder how C works out 
 the semantics.) (And 'static foreach' may be sufficient to 
 handle C's __COUNTER__ for us in a way that leaves the 
 semantics to the user. (I am not sure about that...))

 Actually, for this particular use case, I would be happy with a 
 __COLUMN__ counter that would resolve to the location on the 
 line e.g. exactly where the first '_' character of __COLUMN__ 
 appeared. Combined with __FILE_FULL_PATH__ and __LINE__, this 
 would allow us produce unique positions.

 Does that make sense?

Yes, dmd uses something similar to generate unittests names.

__COUNTER__ would have been fine to name symbols because the 
mangles start with the module name so separate compilation would 
not have caused multiple definitions errors.

Aug 26 2021

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
 On 8/24/21 9:06 AM, Stefan Koch wrote:

 The reason why the string builder had so a flat performance

 curve was A
 BUG.

 I suspected that as soon as saw your initial post. :) Happens 
 to me too.

 results for CTFE;

 I know it's not the same use case but I've standardized on the 
 following style for string formatting both for CTFE and run 
 time:

   auto s = format!"hello %s world %s"(a, b);

 Not that I will change my style but I always wonder how it 
 compares to the following horrible one. :)

   string s;
   s ~= "hello ";
   s ~= a;        // Let's assume 'a' and 'b' are strings
   s ~= " world ";
   s ~= b;

 Thank you,
 Ali

The compile-time performance of std.format is bad.
Just importing the module costs you a lot.

Instantiating some amount of templates per format string and 
parameter type-sets, is also not going to make you particularity 
happy.

As for the runtime performance I couldn't comment.
I assume it depends on your usecase.

Aug 24 2021

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
 Good Morning,

Would it be motivated to have the compiler deduce that `result ~= 
"whatever"` can be lowered to a reallocating append because 
`result` isn't aliased:

```d
enum x = () { string result;
foreach(_; 0 ..n_iterations)
   result ~= "whatever"
return result; } ()
```

Aug 24 2021

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 24 August 2021 at 16:16:50 UTC, Per Nordlöw wrote:
 On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
 Good Morning,

 Would it be motivated to have the compiler deduce that `result 
 ~= "whatever"` can be lowered to a reallocating append because 
 `result` isn't aliased:

 ```d
 enum x = () { string result;
 foreach(_; 0 ..n_iterations)
   result ~= "whatever"
 return result; } ()
 ```

Of course you could try to recognize certain patterns and treat 
them specially,
but that complicates the compiler and now your compile time 
performance is dependent on a bunch of heuristics, which in my 
experience have a habit of failing in the most inconvenient ways 
and situations.

Aug 24 2021

D Programming

C/C++ Programming

Other

digitalmars.D - Sometimes 100 lines of code can fix your ctfe perf