digitalmars.D - Optimization fun

H. S. Teoh via Digitalmars-d (50/50) Nov 06 2014 So today, I was playing around with profiling and optimizing my sliding

Walter Bright (6/35) Nov 06 2014 I've argued this point for over a year, apparently without convincing an...

H. S. Teoh via Digitalmars-d (20/65) Nov 06 2014 I dunno, my impression is that some people agree with it, but consider

Sean Kelly (13/28) Nov 06 2014 We've talked before about keeping statistics of how much memory
thedeemon (5/11) Nov 07 2014 When you grow a hash table it periodically reallocates bucket

H. S. Teoh via Digitalmars-d (18/28) Nov 07 2014 I realize that, it's just that even after accounting for that, the

Ary Borenszweig (6/54) Nov 07 2014 Hi,

H. S. Teoh via Digitalmars-d (13/16) Nov 07 2014 [...]

Ary Borenszweig (3/17) Nov 07 2014 Then I don't think I'll be able to do any useful benchmark/profile with

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

So today, I was playing around with profiling and optimizing my sliding
block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never release
it (it keeps a hash of visited states that only grows, never shrinks).
The profiler indicated that one of the hotspots is the GC. So I added an
option to the program to call GC.disable() if a command line option is
given. The result is a 40%-50% reduction in running time.

2) Auto-decoding is evil: the solver currently uses a na�ve char[]
representation for the puzzle board, and uses countUntil() extensively
to locate things on the board. The original code had:

	auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually
no reason to do so: ch is a char, and therefore countUntil could've used
strchr instead). Simply changing that to:

	auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.


3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked at
gdc's output instead. Digging into the disassembly, I saw that it was
using a foreach loop over the char[], but for some reason even gdc -O3
failed to simplify that loop significantly (I'm not sure why -- maybe
what threw off the optimizer is the array indexing + length check
operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

            auto ptr = current.ptr;
            auto c = current.length;
            while (c > 0 && *ptr != m.id)
                ptr++;
            cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly had
no check for (c > 0)... could it be because there is an assert following
the loop claiming that that will never happen, therefore gdc elided the
check? I thought Walter hasn't gotten around to implementing
optimizations based on assert yet??  Anyway, with this optimization, I
managed to shave off another 3%-5% of the total running time.

On that note, at one point I was wondering why gdc -O3 didn't generate a
"rep scasb" for the inner loop instead of manually incrementing the loop
pointer; but a little research revealed that I'm about 6-7 years out of
date: since about 2008 gcc's optimizer has not used "rep scasb" because
on modern hardware it has atrocious performance -- much worse than a
manually-written C loop! So much for "built-in" string processing that
used to be touted in the old days. Yet another proof of the rule that
overly-complex CPU instruction sets rarely actually perform better.


Anyway, after everything is said and done, a puzzle that used to take
about 20 seconds to solve now only takes about 6 seconds. W00t!


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert

Nov 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
 2) Auto-decoding is evil: the solver currently uses a naïve char[]
 representation for the puzzle board, and uses countUntil() extensively
 to locate things on the board. The original code had:

 	auto offset = currentBoard.countUntil(ch);

 which, of course, triggers autodecoding (even though there is actually
 no reason to do so: ch is a char, and therefore countUntil could've used
 strchr instead). Simply changing that to:

 	auto offset = currentBoard.representation.countUntil(ch);

 gave an additional 20%-30% performance boost.

I've argued this point for over a year, apparently without convincing anyone :-(

BTW, use .byChar. instead of .representation. because the former leaves it
typed 
as 'char' and the latter converts it to 'ubyte'.


 3) However, countUntil remains at the top of the profiler's list of
 hotspots. DMD's optimizer was rather disappointing here, so I looked at
 gdc's output instead. Digging into the disassembly, I saw that it was
 using a foreach loop over the char[], but for some reason even gdc -O3
 failed to simplify that loop significantly (I'm not sure why -- maybe
 what threw off the optimizer is the array indexing + length check
 operation, where in C one would normally jump bump the pointer
 instead?). Eventually, I tried writing a manual loop:

              auto ptr = current.ptr;
              auto c = current.length;
              while (c > 0 && *ptr != m.id)
                  ptr++;
              cert.headOffset = ptr - current.ptr;

 This pushed gdc far enough in the right direction that it finally
 produced a 3-instruction inner loop. Strangely enough, the assembly had
 no check for (c > 0)... could it be because there is an assert following
 the loop claiming that that will never happen, therefore gdc elided the
 check? I thought Walter hasn't gotten around to implementing
 optimizations based on assert yet??  Anyway, with this optimization, I
 managed to shave off another 3%-5% of the total running time.

Walter hasn't gotten around to implementing optimizations in GDC, that's fer 
sure! :-)

Nov 06 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, Nov 06, 2014 at 03:15:04PM -0800, Walter Bright via Digitalmars-d wrote:
 On 11/6/2014 2:58 PM, H. S. Teoh via Digitalmars-d wrote:
2) Auto-decoding is evil: the solver currently uses a na�ve char[]
representation for the puzzle board, and uses countUntil()
extensively to locate things on the board. The original code had:

	auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is
actually no reason to do so: ch is a char, and therefore countUntil
could've used strchr instead). Simply changing that to:

	auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.

 
 I've argued this point for over a year, apparently without convincing
 anyone :-(

I dunno, my impression is that some people agree with it, but consider
that the cost of changing it now a little too prohibitive due to the
extensive code breakage it will cause. But the primary obstacle is that
Andrei is not convinced, so he's going to veto any change in this
direction regardless. You two will have to tussle it out to resolve
this. ;-)


 BTW, use .byChar. instead of .representation. because the former
 leaves it typed as 'char' and the latter converts it to 'ubyte'.

I was going to use .byChar, but since I'm doing performance testing on
gdc, and (my version of) gdc hasn't caught up to the latest Phobos yet,
I have to settle with .representation for now. For my purposes it's
harmless, since I'm not actually doing anything with the ubyte
representation once countUntil is done; the return value is used to
index the original char[].


3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked
at gdc's output instead. Digging into the disassembly, I saw that it
was using a foreach loop over the char[], but for some reason even
gdc -O3 failed to simplify that loop significantly (I'm not sure why
-- maybe what threw off the optimizer is the array indexing + length
check operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

             auto ptr = current.ptr;
             auto c = current.length;
             while (c > 0 && *ptr != m.id)
                 ptr++;
             cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly
had no check for (c > 0)... could it be because there is an assert
following the loop claiming that that will never happen, therefore
gdc elided the check? I thought Walter hasn't gotten around to
implementing optimizations based on assert yet??  Anyway, with this
optimization, I managed to shave off another 3%-5% of the total
running time.

 
 Walter hasn't gotten around to implementing optimizations in GDC,
 that's fer sure! :-)

LOL... I wasn't sure whether said optimizations were going to be in the
front end, or in the backend only. It might not be a bad idea to put it
in the front end where applicable, then all compilers will benefit from
it.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. --
Christopher

Nov 06 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via
Digitalmars-d wrote:
 So today, I was playing around with profiling and optimizing my 
 sliding
 block puzzle solver, and found some interesting things:

 1) The GC could use some serious improvement: it just so 
 happens that
 the solver's algorithm only ever needs to allocate memory, 
 never release
 it (it keeps a hash of visited states that only grows, never 
 shrinks).
 The profiler indicated that one of the hotspots is the GC. So I 
 added an
 option to the program to call GC.disable() if a command line  
 option is
 given. The result is a 40%-50% reduction in running time.

We've talked before about keeping statistics of how much memory
has been reclaimed by collections, and use that to influence
whether a collection should be run before going to the OS for
more.  Something like this should definitely get a bugzilla
ticket.

For now, you might want to consider guessing at how much memory
you'll need and calling GC.reserve() as well.  This should
improve the performance of your app even further.


 2) Auto-decoding is evil

Yep, it's the worst.  Most string operations don't even need
auto-decoding to work properly.  I honestly have no idea why we
have this feature.

Nov 06 2014

"thedeemon" <dlang thedeemon.com> writes:

On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via 
Digitalmars-d wrote:
 1) The GC could use some serious improvement: it just so 
 happens that
 the solver's algorithm only ever needs to allocate memory, 
 never release
 it (it keeps a hash of visited states that only grows, never 
 shrinks).

When you grow a hash table it periodically reallocates bucket 
arrays it uses internally, so some garbage to collect appears 
anyway even if you only add elements.

Nov 07 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Nov 07, 2014 at 09:28:56AM +0000, thedeemon via Digitalmars-d wrote:
 On Thursday, 6 November 2014 at 23:00:19 UTC, H. S. Teoh via Digitalmars-d
 wrote:
1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never
release it (it keeps a hash of visited states that only grows, never
shrinks).

 
 When you grow a hash table it periodically reallocates bucket arrays
 it uses internally, so some garbage to collect appears anyway even if
 you only add elements.

I realize that, it's just that even after accounting for that, the
memory consumption rate is much higher than expected, and sure enough
further investigation revealed unnecessary GC load caused by
reallocating arrays where (re)using an existing one would suffice.

In the case of the function that used to return an array, which would
allocate a new array every call (and there are hundreds of thousands of
calls for this particular test case), after implementing buffer reuse,
even though it does need to reallocate the buffer sometimes when the
array runs out of space, it only reallocates 3-4 times for the entire
run.

I'm not sure what the overhead for the hash table would be, but I'm
expecting it shouldn't be more than, say, 50-100 reallocations of the
bucket array (I didn't measure this). So it shouldn't add up to *that*
much.


T

-- 
There is no gravity. The earth sucks.

Nov 07 2014

Ary Borenszweig <ary esperanto.org.ar> writes:

On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote:
 So today, I was playing around with profiling and optimizing my sliding
 block puzzle solver, and found some interesting things:

 1) The GC could use some serious improvement: it just so happens that
 the solver's algorithm only ever needs to allocate memory, never release
 it (it keeps a hash of visited states that only grows, never shrinks).
 The profiler indicated that one of the hotspots is the GC. So I added an
 option to the program to call GC.disable() if a command line option is
 given. The result is a 40%-50% reduction in running time.

 2) Auto-decoding is evil: the solver currently uses a na�ve char[]
 representation for the puzzle board, and uses countUntil() extensively
 to locate things on the board. The original code had:

 	auto offset = currentBoard.countUntil(ch);

 which, of course, triggers autodecoding (even though there is actually
 no reason to do so: ch is a char, and therefore countUntil could've used
 strchr instead). Simply changing that to:

 	auto offset = currentBoard.representation.countUntil(ch);

 gave an additional 20%-30% performance boost.


 3) However, countUntil remains at the top of the profiler's list of
 hotspots. DMD's optimizer was rather disappointing here, so I looked at
 gdc's output instead. Digging into the disassembly, I saw that it was
 using a foreach loop over the char[], but for some reason even gdc -O3
 failed to simplify that loop significantly (I'm not sure why -- maybe
 what threw off the optimizer is the array indexing + length check
 operation, where in C one would normally jump bump the pointer
 instead?). Eventually, I tried writing a manual loop:

              auto ptr = current.ptr;
              auto c = current.length;
              while (c > 0 && *ptr != m.id)
                  ptr++;
              cert.headOffset = ptr - current.ptr;

 This pushed gdc far enough in the right direction that it finally
 produced a 3-instruction inner loop. Strangely enough, the assembly had
 no check for (c > 0)... could it be because there is an assert following
 the loop claiming that that will never happen, therefore gdc elided the
 check? I thought Walter hasn't gotten around to implementing
 optimizations based on assert yet??  Anyway, with this optimization, I
 managed to shave off another 3%-5% of the total running time.

 On that note, at one point I was wondering why gdc -O3 didn't generate a
 "rep scasb" for the inner loop instead of manually incrementing the loop
 pointer; but a little research revealed that I'm about 6-7 years out of
 date: since about 2008 gcc's optimizer has not used "rep scasb" because
 on modern hardware it has atrocious performance -- much worse than a
 manually-written C loop! So much for "built-in" string processing that
 used to be touted in the old days. Yet another proof of the rule that
 overly-complex CPU instruction sets rarely actually perform better.


 Anyway, after everything is said and done, a puzzle that used to take
 about 20 seconds to solve now only takes about 6 seconds. W00t!


 T

Hi,

Is the code public? I'd like to port it to other languages and see how 
they behave, see if this is a general problem or just something specific 
to D.

Thanks!

Nov 07 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
[...]
 Is the code public? I'd like to port it to other languages and see how
 they behave, see if this is a general problem or just something
 specific to D.

[...]

I haven't posted the code anywhere yet, but I could certainly send you a
copy if you want.

However, the test input I'm using is not public, so you'll have to craft
your own input files for testing purposes. The public input files I do
have are too simple to pose a significant challenge to the solver (plus,
they default to BFS instead of A*, so you might get different results
from what I described here).


T

-- 
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day
and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a
clever lawsuit / That Windowsland will finally fall, / But thousands writing
open source code / Like mice who nibble through a wall. -- The Linux-nationale
by Greg Baker

Nov 07 2014

Ary Borenszweig <ary esperanto.org.ar> writes:

On 11/7/14, 4:16 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Nov 07, 2014 at 04:06:44PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 [...]
 Is the code public? I'd like to port it to other languages and see how
 they behave, see if this is a general problem or just something
 specific to D.

 [...]

 I haven't posted the code anywhere yet, but I could certainly send you a
 copy if you want.

 However, the test input I'm using is not public, so you'll have to craft
 your own input files for testing purposes. The public input files I do
 have are too simple to pose a significant challenge to the solver (plus,
 they default to BFS instead of A*, so you might get different results
 from what I described here).


 T

Then I don't think I'll be able to do any useful benchmark/profile with 
that. Thanks anyway :-)

Nov 07 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Optimization fun