www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Optimization fun

reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
So today, I was playing around with profiling and optimizing my sliding
block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never release
it (it keeps a hash of visited states that only grows, never shrinks).
The profiler indicated that one of the hotspots is the GC. So I added an
option to the program to call GC.disable() if a command line option is
given. The result is a 40%-50% reduction in running time.

2) Auto-decoding is evil: the solver currently uses a na´ve char[]
representation for the puzzle board, and uses countUntil() extensively
to locate things on the board. The original code had:

	auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually
no reason to do so: ch is a char, and therefore countUntil could've used
strchr instead). Simply changing that to:

	auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.


3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked at
gdc's output instead. Digging into the disassembly, I saw that it was
using a foreach loop over the char[], but for some reason even gdc -O3
failed to simplify that loop significantly (I'm not sure why -- maybe
what threw off the optimizer is the array indexing + length check
operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

            auto ptr = current.ptr;
            auto c = current.length;
            while (c > 0 && *ptr != m.id)
                ptr++;
            cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly had
no check for (c > 0)... could it be because there is an assert following
the loop claiming that that will never happen, therefore gdc elided the
check? I thought Walter hasn't gotten around to implementing
optimizations based on assert yet??  Anyway, with this optimization, I
managed to shave off another 3%-5% of the total running time.

On that note, at one point I was wondering why gdc -O3 didn't generate a
"rep scasb" for the inner loop instead of manually incrementing the loop
pointer; but a little research revealed that I'm about 6-7 years out of
date: since about 2008 gcc's optimizer has not used "rep scasb" because
on modern hardware it has atrocious performance -- much worse than a
manually-written C loop! So much for "built-in" string processing that
used to be touted in the old days. Yet another proof of the rule that
overly-complex CPU instruction sets rarely actually perform better.


Anyway, after everything is said and done, a puzzle that used to take
about 20 seconds to solve now only takes about 6 seconds. W00t!


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
Nov 06 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
 2) Auto-decoding is evil: the solver currently uses a naïve char[]
 representation for the puzzle board, and uses countUntil() extensively
 to locate things on the board. The original code had:

 	auto offset = currentBoard.countUntil(ch);

 which, of course, triggers autodecoding (even though there is actually
 no reason to do so: ch is a char, and therefore countUntil could've used
 strchr instead). Simply changing that to:

 	auto offset = currentBoard.representation.countUntil(ch);

 gave an additional 20%-30% performance boost.
I've argued this point for over a year, apparently without convincing anyone :-( BTW, use .byChar. instead of .representation. because the former leaves it typed as 'char' and the latter converts it to 'ubyte'.
 3) However, countUntil remains at the top of the profiler's list of
 hotspots. DMD's optimizer was rather disappointing here, so I looked at
 gdc's output instead. Digging into the disassembly, I saw that it was
 using a foreach loop over the char[], but for some reason even gdc -O3
 failed to simplify that loop significantly (I'm not sure why -- maybe
 what threw off the optimizer is the array indexing + length check
 operation, where in C one would normally jump bump the pointer
 instead?). Eventually, I tried writing a manual loop:

              auto ptr = current.ptr;
              auto c = current.length;
              while (c > 0 && *ptr != m.id)
                  ptr++;
              cert.headOffset = ptr - current.ptr;

 This pushed gdc far enough in the right direction that it finally
 produced a 3-instruction inner loop. Strangely enough, the assembly had
 no check for (c > 0)... could it be because there is an assert following
 the loop claiming that that will never happen, therefore gdc elided the
 check? I thought Walter hasn't gotten around to implementing
 optimizations based on assert yet??  Anyway, with this optimization, I
 managed to shave off another 3%-5% of the total running time.
Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)
Nov 06 2014
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Nov 06, 2014 at 03:15:04PM -0800, Walter Bright via Digitalmars-d wrote:
 On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
2) Auto-decoding is evil: the solver currently uses a na´ve char[]
representation for the puzzle board, and uses countUntil()
extensively to locate things on the board. The original code had:

	auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is
actually no reason to do so: ch is a char, and therefore countUntil
could've used strchr instead). Simply changing that to:

	auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.
I've argued this point for over a year, apparently without convincing anyone :-(
I dunno, my impression is that some people agree with it, but consider that the cost of changing it now a little too prohibitive due to the extensive code breakage it will cause. But the primary obstacle is that Andrei is not convinced, so he's going to veto any change in this direction regardless. You two will have to tussle it out to resolve this. ;-)
 BTW, use .byChar. instead of .representation. because the former
 leaves it typed as 'char' and the latter converts it to 'ubyte'.
I was going to use .byChar, but since I'm doing performance testing on gdc, and (my version of) gdc hasn't caught up to the latest Phobos yet, I have to settle with .representation for now. For my purposes it's harmless, since I'm not actually doing anything with the ubyte representation once countUntil is done; the return value is used to index the original char[].
3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked
at gdc's output instead. Digging into the disassembly, I saw that it
was using a foreach loop over the char[], but for some reason even
gdc -O3 failed to simplify that loop significantly (I'm not sure why
-- maybe what threw off the optimizer is the array indexing + length
check operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

             auto ptr = current.ptr;
             auto c = current.length;
             while (c > 0 && *ptr != m.id)
                 ptr++;
             cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly
had no check for (c > 0)... could it be because there is an assert
following the loop claiming that that will never happen, therefore
gdc elided the check? I thought Walter hasn't gotten around to
implementing optimizations based on assert yet??  Anyway, with this
optimization, I managed to shave off another 3%-5% of the total
running time.
Walter hasn't gotten around to implementing optimizations in GDC, that's fer sure! :-)
LOL... I wasn't sure whether said optimizations were going to be in the front end, or in the backend only. It might not be a bad idea to put it in the front end where applicable, then all compilers will benefit from it. T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
Nov 06 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via
Digitalmars-d wrote:
 So today, I was playing around with profiling and optimizing my 
 sliding
 block puzzle solver, and found some interesting things:

 1) The GC could use some serious improvement: it just so 
 happens that
 the solver's algorithm only ever needs to allocate memory, 
 never release
 it (it keeps a hash of visited states that only grows, never 
 shrinks).
 The profiler indicated that one of the hotspots is the GC. So I 
 added an
 option to the program to call GC.disable() if a command line  
 option is
 given. The result is a 40%-50% reduction in running time.
We've talked before about keeping statistics of how much memory has been reclaimed by collections, and use that to influence whether a collection should be run before going to the OS for more. Something like this should definitely get a bugzilla ticket. For now, you might want to consider guessing at how much memory you'll need and calling GC.reserve() as well. This should improve the performance of your app even further.
 2) Auto-decoding is evil
Yep, it's the worst. Most string operations don't even need auto-decoding to work properly. I honestly have no idea why we have this feature.
Nov 06 2014
prev sibling next sibling parent reply "thedeemon" <dlang thedeemon.com> writes:
On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via 
Digitalmars-d wrote:
 1) The GC could use some serious improvement: it just so 
 happens that
 the solver's algorithm only ever needs to allocate memory, 
 never release
 it (it keeps a hash of visited states that only grows, never 
 shrinks).
When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.
Nov 07 2014
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Nov 07, 2014 at 09:28:56AM +0000, thedeemon via Digitalmars-d wrote:
 On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d
 wrote:
1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never
release it (it keeps a hash of visited states that only grows, never
shrinks).
When you grow a hash table it periodically reallocates bucket arrays it uses internally, so some garbage to collect appears anyway even if you only add elements.
I realize that, it's just that even after accounting for that, the memory consumption rate is much higher than expected, and sure enough further investigation revealed unnecessary GC load caused by reallocating arrays where (re)using an existing one would suffice. In the case of the function that used to return an array, which would allocate a new array every call (and there are hundreds of thousands of calls for this particular test case), after implementing buffer reuse, even though it does need to reallocate the buffer sometimes when the array runs out of space, it only reallocates 3-4 times for the entire run. I'm not sure what the overhead for the hash table would be, but I'm expecting it shouldn't be more than, say, 50-100 reallocations of the bucket array (I didn't measure this). So it shouldn't add up to *that* much. T -- There is no gravity. The earth sucks.
Nov 07 2014
prev sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote:
 So today, I was playing around with profiling and optimizing my sliding
 block puzzle solver, and found some interesting things:

 1) The GC could use some serious improvement: it just so happens that
 the solver's algorithm only ever needs to allocate memory, never release
 it (it keeps a hash of visited states that only grows, never shrinks).
 The profiler indicated that one of the hotspots is the GC. So I added an
 option to the program to call GC.disable() if a command line option is
 given. The result is a 40%-50% reduction in running time.

 2) Auto-decoding is evil: the solver currently uses a na´ve char[]
 representation for the puzzle board, and uses countUntil() extensively
 to locate things on the board. The original code had:

 	auto offset = currentBoard.countUntil(ch);

 which, of course, triggers autodecoding (even though there is actually
 no reason to do so: ch is a char, and therefore countUntil could've used
 strchr instead). Simply changing that to:

 	auto offset = currentBoard.representation.countUntil(ch);

 gave an additional 20%-30% performance boost.


 3) However, countUntil remains at the top of the profiler's list of
 hotspots. DMD's optimizer was rather disappointing here, so I looked at
 gdc's output instead. Digging into the disassembly, I saw that it was
 using a foreach loop over the char[], but for some reason even gdc -O3
 failed to simplify that loop significantly (I'm not sure why -- maybe
 what threw off the optimizer is the array indexing + length check
 operation, where in C one would normally jump bump the pointer
 instead?). Eventually, I tried writing a manual loop:

              auto ptr = current.ptr;
              auto c = current.length;
              while (c > 0 && *ptr != m.id)
                  ptr++;
              cert.headOffset = ptr - current.ptr;

 This pushed gdc far enough in the right direction that it finally
 produced a 3-instruction inner loop. Strangely enough, the assembly had
 no check for (c > 0)... could it be because there is an assert following
 the loop claiming that that will never happen, therefore gdc elided the
 check? I thought Walter hasn't gotten around to implementing
 optimizations based on assert yet??  Anyway, with this optimization, I
 managed to shave off another 3%-5% of the total running time.

 On that note, at one point I was wondering why gdc -O3 didn't generate a
 "rep scasb" for the inner loop instead of manually incrementing the loop
 pointer; but a little research revealed that I'm about 6-7 years out of
 date: since about 2008 gcc's optimizer has not used "rep scasb" because
 on modern hardware it has atrocious performance -- much worse than a
 manually-written C loop! So much for "built-in" string processing that
 used to be touted in the old days. Yet another proof of the rule that
 overly-complex CPU instruction sets rarely actually perform better.


 Anyway, after everything is said and done, a puzzle that used to take
 about 20 seconds to solve now only takes about 6 seconds. W00t!


 T
Hi, Is the code public? I'd like to port it to other languages and see how they behave, see if this is a general problem or just something specific to D. Thanks!
Nov 07 2014
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
[...]
 Is the code public? I'd like to port it to other languages and see how
 they behave, see if this is a general problem or just something
 specific to D.
[...] I haven't posted the code anywhere yet, but I could certainly send you a copy if you want. However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here). T -- Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a clever lawsuit / That Windowsland will finally fall, / But thousands writing open source code / Like mice who nibble through a wall. -- The Linux-nationale by Greg Baker
Nov 07 2014
parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 11/7/14, 4:16 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 [...]
 Is the code public? I'd like to port it to other languages and see how
 they behave, see if this is a general problem or just something
 specific to D.
[...] I haven't posted the code anywhere yet, but I could certainly send you a copy if you want. However, the test input I'm using is not public, so you'll have to craft your own input files for testing purposes. The public input files I do have are too simple to pose a significant challenge to the solver (plus, they default to BFS instead of A*, so you might get different results from what I described here). T
Then I don't think I'll be able to do any useful benchmark/profile with that. Thanks anyway :-)
Nov 07 2014