digitalmars.D.learn - A significant performance difference

bearophile (95/95) Aug 31 2014 This is C++ code that solves one Euler problem:

Daniel Kozak via Digitalmars-d-learn (6/128) Sep 01 2014 V Sun, 31 Aug 2014 10:55:31 +0000
ketmar via Digitalmars-d-learn (8/11) Sep 01 2014 On Mon, 1 Sep 2014 09:21:03 +0200
ketmar via Digitalmars-d-learn (52/52) Sep 01 2014 On Mon, 1 Sep 2014 09:21:03 +0200

bearophile (17/27) Sep 01 2014 This is a little cleaner:

ketmar via Digitalmars-d-learn (9/13) Sep 01 2014 On Mon, 01 Sep 2014 08:53:45 +0000

bearophile (10/17) Sep 01 2014 In theory the best solution is to improve the performance of the

ketmar via Digitalmars-d-learn (33/36) Sep 01 2014 On Mon, 01 Sep 2014 09:22:50 +0000
ketmar via Digitalmars-d-learn (12/13) Sep 01 2014 On Mon, 1 Sep 2014 12:38:52 +0300
Daniel Kozak via Digitalmars-d-learn (16/65) Sep 01 2014 V Mon, 1 Sep 2014 12:38:52 +0300
ketmar via Digitalmars-d-learn (7/8) Sep 01 2014 On Mon, 1 Sep 2014 12:51:58 +0200

ketmar via Digitalmars-d-learn (5/6) Sep 01 2014 On Mon, 01 Sep 2014 08:53:45 +0000
bearophile (5/7) Sep 01 2014 https://issues.dlang.org/show_bug.cgi?id=13410

ketmar via Digitalmars-d-learn (9/10) Sep 01 2014 On Mon, 01 Sep 2014 09:16:04 +0000
ketmar via Digitalmars-d-learn (9/10) Sep 01 2014 On Mon, 01 Sep 2014 09:16:04 +0000
ketmar via Digitalmars-d-learn (8/9) Sep 01 2014 On Mon, 01 Sep 2014 09:16:04 +0000

safety0ff (38/38) Sep 01 2014 The following D code runs over 2x faster than the C++ code
Martin Nowak (3/3) Oct 01 2014 You're comparing front removal on an ordered vs. an unordered

"bearophile" <bearophileHUGS lycos.com> writes:

This is C++ code that solves one Euler problem:

--------------
#include <stdio.h>
#include <map>

const unsigned int H = 9, W = 12;

const int g[6][3] = {{7, 0, H - 3},
                      {1 + (1 << H) + (1 << (2 * H)), 0, H - 1},
                      {3 + (1 << H), 0, H - 2},
                      {3 + (2 << H), 0, H - 2},
                      {1 + (1 << H) + (2 << H), 0, H - 2},
                      {1 + (1 << H) + (1 << (H - 1)), 1, H - 1}};

int main() {
     unsigned long long p, i, k;
     unsigned int j, l;
     std::map<unsigned int, unsigned long long> x, y;
     x[0] = 1;

     for (i = 0; i < W; ++i) {
         y.clear();
         while (!x.empty()) {
             j = x.begin()->first;
             p = x.begin()->second;
             x.erase(x.begin());

             for (k = 0; k < H; ++k)
                 if ((j & (1 << k)) == 0)
                     break;

             if (k == H)
                 y[j >> H] += p;
             else
                 for (l = 0; l < 6; ++l)
                     if (k >= g[l][1] && k <= g[l][2])
                         if ((j & (g[l][0] << k)) == 0)
                             x[j + (g[l][0] << k)] += p;
         }
         x = y;
     }

     printf("%lld\n", y[0]);
     return 0;
}
--------------


I have translated it to D like this (I know in D there are nicer 
ways to write it, but I have tried to keep the look of the code 
as much similar as possible to the C++ code):


--------------
import core.stdc.stdio;

const uint H = 9, W = 12;

const uint[3][6] g = [[7, 0, H - 3],
                       [1 + (1 << H) + (1 << (2 * H)), 0, H - 1],
                       [3 + (1 << H), 0, H - 2],
                       [3 + (2 << H), 0, H - 2],
                       [1 + (1 << H) + (2 << H), 0, H - 2],
                       [1 + (1 << H) + (1 << (H - 1)), 1, H - 1]];

int main() {
     ulong p, i, k;
     uint j, l;
     ulong[uint] x, y;
     x[0] = 1;

     for (i = 0; i < W; ++i) {
         y = null;
         while (x.length) {
             j = x.byKey.front;
             p = x.byValue.front;
             x.remove(cast(int)j);

             for (k = 0; k < H; ++k)
                 if ((j & (1 << k)) == 0)
                     break;

             if (k == H)
                 y[j >> H] += p;
             else
                 for (l = 0; l < 6; ++l)
                     if (k >= g[l][1] && k <= g[l][2])
                         if ((j & (g[l][0] << k)) == 0)
                             x[j + (g[l][0] << k)] += p;
         }
         x = y;
     }

     printf("%lld\n", y[0]);
     return 0;
}
--------------


The C++ code is much faster than the D code (I see the D code 30+ 
times slower with dmd and about like 20 times with ldc2). One 
difference between the C++ and D code is that the C++ map uses a 
search tree (red-black probably), while the D code uses a hash.

To test that algorithmic difference, if I replace the map in the 
C++ code with a std::unordered_map (C++11):

#include <unordered_map>
...
std::unordered_map<unsigned int, unsigned long long> x, y;


then the run-time increases (more than two times) but it's still 
much faster than the D code.

Is it possible to fix the D code to increase its performance 
(there are associative array libraries for D, but I have not 
tried them in this program).

Bye,
bearophile

Aug 31 2014

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

V Sun, 31 Aug 2014 10:55:31 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
napsáno:

 This is C++ code that solves one Euler problem:
 
 --------------
 #include <stdio.h>
 #include <map>
 
 const unsigned int H = 9, W = 12;
 
 const int g[6][3] = {{7, 0, H - 3},
                       {1 + (1 << H) + (1 << (2 * H)), 0, H - 1},
                       {3 + (1 << H), 0, H - 2},
                       {3 + (2 << H), 0, H - 2},
                       {1 + (1 << H) + (2 << H), 0, H - 2},
                       {1 + (1 << H) + (1 << (H - 1)), 1, H - 1}};
 
 int main() {
      unsigned long long p, i, k;
      unsigned int j, l;
      std::map<unsigned int, unsigned long long> x, y;
      x[0] = 1;
 
      for (i = 0; i < W; ++i) {
          y.clear();
          while (!x.empty()) {
              j = x.begin()->first;
              p = x.begin()->second;
              x.erase(x.begin());
 
              for (k = 0; k < H; ++k)
                  if ((j & (1 << k)) == 0)
                      break;
 
              if (k == H)
                  y[j >> H] += p;
              else
                  for (l = 0; l < 6; ++l)
                      if (k >= g[l][1] && k <= g[l][2])
                          if ((j & (g[l][0] << k)) == 0)
                              x[j + (g[l][0] << k)] += p;
          }
          x = y;
      }
 
      printf("%lld\n", y[0]);
      return 0;
 }
 --------------
 
 
 I have translated it to D like this (I know in D there are nicer 
 ways to write it, but I have tried to keep the look of the code 
 as much similar as possible to the C++ code):
 
 
 --------------
 import core.stdc.stdio;
 
 const uint H = 9, W = 12;
 
 const uint[3][6] g = [[7, 0, H - 3],
                        [1 + (1 << H) + (1 << (2 * H)), 0, H - 1],
                        [3 + (1 << H), 0, H - 2],
                        [3 + (2 << H), 0, H - 2],
                        [1 + (1 << H) + (2 << H), 0, H - 2],
                        [1 + (1 << H) + (1 << (H - 1)), 1, H - 1]];
 
 int main() {
      ulong p, i, k;
      uint j, l;
      ulong[uint] x, y;
      x[0] = 1;
 
      for (i = 0; i < W; ++i) {
          y = null;
          while (x.length) {
              j = x.byKey.front;
              p = x.byValue.front;
              x.remove(cast(int)j);
 
              for (k = 0; k < H; ++k)
                  if ((j & (1 << k)) == 0)
                      break;
 
              if (k == H)
                  y[j >> H] += p;
              else
                  for (l = 0; l < 6; ++l)
                      if (k >= g[l][1] && k <= g[l][2])
                          if ((j & (g[l][0] << k)) == 0)
                              x[j + (g[l][0] << k)] += p;
          }
          x = y;
      }
 
      printf("%lld\n", y[0]);
      return 0;
 }
 --------------
 
 
 The C++ code is much faster than the D code (I see the D code 30+ 
 times slower with dmd and about like 20 times with ldc2). One 
 difference between the C++ and D code is that the C++ map uses a 
 search tree (red-black probably), while the D code uses a hash.
 
 To test that algorithmic difference, if I replace the map in the 
 C++ code with a std::unordered_map (C++11):
 
 #include <unordered_map>
 ...
 std::unordered_map<unsigned int, unsigned long long> x, y;
 
 
 then the run-time increases (more than two times) but it's still 
 much faster than the D code.
 
 Is it possible to fix the D code to increase its performance 
 (there are associative array libraries for D, but I have not 
 tried them in this program).
 
 Bye,
 bearophile

I think main problem is in calling delegates (x.byKey.front and
x.byValue.front;). If I change x.byValue.front with x[j], It makes
program a lot faster

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 1 Sep 2014 09:21:03 +0200
Daniel Kozak via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 I think main problem is in calling delegates (x.byKey.front and
 x.byValue.front;). If I change x.byValue.front with x[j], It makes
 program a lot faster

yes. replacing `p =3D x.byValue.front;` by `p =3D x[j];` cutted
down execution time from 3.2 seconds to 1.8 seconds with my gdc.

i always dreamt about official 'get any key from AA' API. and, btw,
'get value and remove key' API too. they are useful sometimes.

seems that i have to write another silly patch for this. ;-)

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 1 Sep 2014 09:21:03 +0200
Daniel Kozak via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

and this version executes in 0.2 seconds:

    import core.stdc.stdio;

    immutable H =3D 9, W =3D 12;

    immutable uint[3][6] g =3D [
      [7, 0, H - 3],
      [1 + (1 << H) + (1 << (2 * H)), 0, H - 1],
      [3 + (1 << H), 0, H - 2],
      [3 + (2 << H), 0, H - 2],
      [1 + (1 << H) + (2 << H), 0, H - 2],
      [1 + (1 << H) + (1 << (H - 1)), 1, H - 1]
    ];


    int main () {
      ulong p, i, k;
      uint j, l;
      ulong[uint] x, y;
      x[0] =3D 1;

      for (i =3D 0; i < W; ++i) {
        y =3D null;
        while (x.length) {
          foreach (key; x.keys) {
            auto pp =3D key in x;
            if (pp is null) continue;
            j =3D key;
            p =3D *pp;
            x.remove(j);

            for (k =3D 0; k < H; ++k) {
              if ((j & (1 << k)) =3D=3D 0) break;
            }

            if (k =3D=3D H) {
              y[j >> H] +=3D p;
            } else {
              for (l =3D 0; l < 6; ++l) {
                if (k >=3D g[l][1] && k <=3D g[l][2]) {
                  if ((j & (g[l][0] << k)) =3D=3D 0) {
                    x[j + (g[l][0] << k)] +=3D p;
                  }
                }
              }
            }
          }
        }
        x =3D y;
      }

      printf("%lld\n", y[0]);
      return 0;
    }

so yes, creating delegates are SLOW. even taking into account
that we creating dynamic arrays with keys in this version, it's rockin'
fast.

Sep 01 2014

"bearophile" <bearophileHUGS lycos.com> writes:

ketmar:

Thank you for your analysis and code.

         while (x.length) {
           foreach (key; x.keys) {
             auto pp = key in x;
             if (pp is null) continue;
             j = key;
             p = *pp;
             x.remove(j);

This is a little cleaner:

         while (x.length) {
             foreach (immutable j; x.keys) {
                 p = x[j];
                 x.remove(j);


 so yes, creating delegates are SLOW. even taking into account
 that we creating dynamic arrays with keys in this version, it's 
 rockin' fast.

This D version is much faster, it runs in about 0.26 seconds 
(while the C++ version runs in about 0.21 seconds), this is 
acceptable because x.keys allocates a dynamic array of keys in 
the middle of the code.

But I think the "j = x.byKey.front;  p = x.byValue.front;" code 
looks sufficiently idiomatic, and its performance is terrible. It 
seems fit for a performance bug report. I have to create a 
synthetic benchmark that shows just the problem.

Bye,
bearophile

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 01 Sep 2014 08:53:45 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 ketmar:
 Thank you for your analysis and code.

actually, it's based on Daniel Kozak's observation, so credit him
instead. ;-)

 But I think the "j =3D x.byKey.front;  p =3D x.byValue.front;" code=20
 looks sufficiently idiomatic, and its performance is terrible.

that's why i suggest to extend AA API, adding x.firstKey and
x.firstValue functions. i'll try to make the patch soon.

ah, and x.getAndRemove() too, it's handy sometimes. but this will
need new druntime hook, i think. ;-)

Sep 01 2014

"bearophile" <bearophileHUGS lycos.com> writes:

ketmar:

 But I think the "j = x.byKey.front;  p = x.byValue.front;" 
 code looks sufficiently idiomatic, and its performance is 
 terrible.

 that's why i suggest to extend AA API, adding x.firstKey and
 x.firstValue functions. i'll try to make the patch soon.

In theory the best solution is to improve the performance of the 
"byKey.front" and "byValue.front" idioms.

If that's not possible and you need to add the 
firstKey/firstValue functions, then it's a good idea to make the 
D front-end rewrite "byKey.front" with a call to "firstKey" and a 
"byValue.front" with a call to firstValue.


 ah, and x.getAndRemove() too, it's handy sometimes. but this 
 will need new druntime hook, i think. ;-)

x.pop() sounds nicer :-)

Bye,
bearophile

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 01 Sep 2014 09:22:50 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 In theory the best solution is to improve the performance of the=20
 "byKey.front" and "byValue.front" idioms.

i found that slowdown is from _aaRange(), not from delegates.
the following code is slow even w/o delegate creation:

    import core.stdc.stdio;

    ref K firstKey(T : V[K], V, K)(T aa)  property
    {
        return *cast(K*)_aaRangeFrontKey(_aaRange(cast(void*)aa));
    }

    ref V firstVal(T : V[K], V, K)(T aa)  property
    {
        return *cast(V*)_aaRangeFrontValue(_aaRange(cast(void*)aa));
    }


    int main() {
        long[int] aa;
        for (int i =3D 0; i < 50000; i++)
            aa[i] =3D i;

        long total =3D 0;

        while (aa.length) {
            //int k =3D aa.byKey.front;
            int k =3D aa.firstKey;
            //long v =3D aa.byValue.front;
            long v =3D aa.firstVal;
            aa.remove(k);
            total +=3D k + v;
        }

        printf("%lld\n", total);
        return 0;
    }


seems that we need two more hooks for AAs: `_aaFrontKey()` and
`_aaFrontValue()`.

 x.pop() sounds nicer :-)

ah, sure. i'm not very good in inventing names. ;-)

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 1 Sep 2014 12:38:52 +0300
ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 i found that slowdown is from _aaRange()

and i'm wrong again. adding `_aaFrontKey()` and `_aaFrontValue()` makes
no difference at all.

that's 'cause both hooks need to go thru bucket array to find first
entry. unless we add some ponter to 'first used bucket', getting
'first' key and value will be slow.

so, to make this fast, we need to cache info about 'first used bucket'.

i hacked in VERY simple caching, and... voila: execution time for
'byKey/byValue' variant drops from 3 seconds to 1.5 seconds. seems that
it's as far as i can get without hurting AA performance in other cases.

Sep 01 2014

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

V Mon, 1 Sep 2014 12:38:52 +0300
ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
napsáno:

 On Mon, 01 Sep 2014 09:22:50 +0000
 bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
 wrote:
 
 In theory the best solution is to improve the performance of the 
 "byKey.front" and "byValue.front" idioms.

 i found that slowdown is from _aaRange(), not from delegates.
 the following code is slow even w/o delegate creation:
 
     import core.stdc.stdio;
 
     ref K firstKey(T : V[K], V, K)(T aa)  property
     {
         return *cast(K*)_aaRangeFrontKey(_aaRange(cast(void*)aa));
     }
 
     ref V firstVal(T : V[K], V, K)(T aa)  property
     {
         return *cast(V*)_aaRangeFrontValue(_aaRange(cast(void*)aa));
     }
 
 
     int main() {
         long[int] aa;
         for (int i = 0; i < 50000; i++)
             aa[i] = i;
 
         long total = 0;
 
         while (aa.length) {
             //int k = aa.byKey.front;
             int k = aa.firstKey;
             //long v = aa.byValue.front;
             long v = aa.firstVal;
             aa.remove(k);
             total += k + v;
         }
 
         printf("%lld\n", total);
         return 0;
     }
 
 
 seems that we need two more hooks for AAs: `_aaFrontKey()` and
 `_aaFrontValue()`.
 
 x.pop() sounds nicer :-)

 ah, sure. i'm not very good in inventing names. ;-)
 

Yep, I found out something similar. with this ugly code

for (i = 0; i < W; ++i) {
		y = null;
		while (x.length) {

			foreach(lj, lp ; x) {
				j = lj;
				p = lp;
				x.remove(cast(int)j);
				break;
			}

			...
	}

The problem is with searching first element

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 1 Sep 2014 12:51:58 +0200
Daniel Kozak via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 The problem is with searching first element

yes. AAs aren't very good in "give me some so-called 'first' element",
especially current druntime AAs. i'm not sure it worth heavy-fixing
though. i sped it up a little (see bugzilla), but it will remain
painfully slow, alas. AAs just don't fit in such use cases.

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 01 Sep 2014 08:53:45 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 This is a little cleaner:

yes, i just did a 'quickhack' without even analyzing what the code
does. ;-)

Sep 01 2014

"bearophile" <bearophileHUGS lycos.com> writes:

 It seems fit for a performance bug report. I have to create a 
 synthetic benchmark that shows just the problem.

https://issues.dlang.org/show_bug.cgi?id=13410

(Perhaps I have credited the wrong person there, sorry, I will 
fix).

Bye,
bearophile

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 01 Sep 2014 09:16:04 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 https://issues.dlang.org/show_bug.cgi?id=3D13410

i added 'quick-hack-patch' to your report which halves execution time
without significantly hurting performance in other AA uses.

now there is no difference between `long v =3D aa.byValue.front;` and
`long v =3D aa[k];`.

yet aa.remove() still hurts badly, and i don't think that we can do any
better with it.

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 01 Sep 2014 09:16:04 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 https://issues.dlang.org/show_bug.cgi?id=3D13410

added another version of the patch. it doesn't significantly improves
your original code execution time, but makes syntetic tests runs with
the same speed.

aa.remove() is little slower now, but i don't think that it's even
noticable. but now we can use AAs in 'take first item and immideately
remove it' scenarios without performance hit.

Sep 01 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 01 Sep 2014 09:16:04 +0000
bearophile via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 https://issues.dlang.org/show_bug.cgi?id=3D13410

ah, silly me. i should stop publish patches until i rewrote 'em at
least three times.

we have a winner now: new patch should not hurt AA performance in
normal use cases yet our cases are now significantly faster (3 to 3000
times, yeah! ;-).

Sep 01 2014

"safety0ff" <safety0ff.dev gmail.com> writes:

The following D code runs over 2x faster than the C++ code 
(comparing dmd no options to g++ no options.) Its not a fair 
comparison because it changes the order of operations.

import core.stdc.stdio;

const uint H = 9, W = 12;

const uint[3][6] g = [[7, 0, H - 3],
                       [1 + (1 << H) + (1 << (2 * H)), 0, H - 1],
                       [3 + (1 << H), 0, H - 2],
                       [3 + (2 << H), 0, H - 2],
                       [1 + (1 << H) + (2 << H), 0, H - 2],
                       [1 + (1 << H) + (1 << (H - 1)), 1, H - 1]];

int main() {
     ulong p, i, k;
     ulong[uint] x, y;
     uint l;
     x[0] = 1;

     for (i = 0; i < W; ++i) {
         y = null;
         while (x.length)
         foreach (j; x.keys) {
             p = x[j];
             x.remove(j);

             for (k = 0; k < H; ++k)
                 if ((j & (1 << k)) == 0)
                     break;

             if (k == H)
                 y[j >> H] += p;
             else
                 for (l = 0; l < 6; ++l)
                     if (k >= g[l][1] && k <= g[l][2])
                         if ((j & (g[l][0] << k)) == 0)
                             x[j + (g[l][0] << k)] += p;
         }
         x = y;
     }

     printf("%lld\n", y[0]);
     return 0;
}

Sep 01 2014

"Martin Nowak" <code dawg.eu> writes:

You're comparing front removal on an ordered vs. an unordered 
container.
Anyhow at least my C++ lib also caches the first used bucket.

Oct 01 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - A significant performance difference