www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Slow GC?

reply bearophile <bearophileHUGS lycos.com> writes:
While writing a program I have seen it's quite slow (about 35-40 seconds to run
a program that requires less than 8 seconds to run with Python), I have
progressively simplified it, to the following D code, able to show a similar
problem:


import std.file, std.string, std.gc;
import std.stdio: putr = writefln;
static import std.c.time;
double clock() {
    auto t = std.c.time.clock();
    return t/cast(double)std.c.time.CLOCKS_PER_SEC;
}
void main() {
  //disable;
  auto t0 = clock();
  auto txt = cast(string)read("text.txt"); // 6.3 MB of text
  auto t1 = clock();
  auto words = txt.split();
  auto t2 = clock();

  putr("loading time: ", t1 - t0); // 0.08 s
  putr("splitting time: ", t2 - t1); // 3.69 s with GC, 0.66 s without
  // Total running time with GC = 10.85 s
}


As you can see the Running time is about 3.7 seconds, but then it needs about 7
more seconds to deallocate the memory used.

I have written the same program in Python, not using Psyco or disabling the GC
or doing something fancy:


from time import clock
t0 = clock()
txt = file("text.txt").read() // 6.3 MB of text
t1 = clock()
words = txt.split()
t2 = clock()





Splitting is twice faster, but that's not the most important thing to see: when
the program is finished it needs about 1.1 seconds to deallocate its memory and
terminate.
It seems there is some problem in the D GC (in my original D program on a total
running time of about 40 seconds, 17 are necessary for the final memory
deallocation, while the same program in Python needs less than 2 seconds for
the final memory + VirtualMachine deallocations).

Tests performed on a Win2K, 256 MB RAM, pentium3 500 MHz, with Python 2.5.2,
and DMD 1.028. My timings are approximated but they don't need more than a 0.2
s resolution to show the situation.

Bye,
bearophile
Mar 13 2008
parent reply "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Fri, 14 Mar 2008 04:38:23 +0200, bearophile <bearophileHUGS lycos.com> wrote:

 While writing a program I have seen it's quite slow (about 35-40 seconds to
run a program that requires less than 8 seconds to run with Python), I have
progressively simplified it, to the following D code, able to show a similar
problem:
Splitting a huge block of text by whitespace will cause an array that's comparable to the size of the original text (a slice is 8 bytes, so if words are 6-7 characters long by average, the size would be the same). The slowdown appears because the GC needs to look up every one of those pointers, which point to the same large memory region. I made a patch for Tango's GC which caches the last page during the "mark" part of the scan. This reduces the execution time of your program tenfold for me, but it will only work when there's many consecutive pointers to the same GC page which is part of a big (>2048 bytes) allocation. There should be no noticeable performance penalty for the general case (as the overhead is just one compare and sometimes an assignment), so I've created an enhancement ticket: http://dsource.org/projects/tango/ticket/982 Interestingly, when I first compiled your program with Tango's runtime, and it ran a lot slower. I found the bottleneck, then created a ticket with an explanation and patch here: http://dsource.org/projects/tango/ticket/981 I also have a patch for Phobos (for expanding arrays, not the GC), but it doesn't yield a big improvement (about 13% faster). -- Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
parent reply "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Fri, 14 Mar 2008 09:54:22 +0200, Vladimir Panteleev <thecybershadow g=
mail.com> wrote:

 I made a patch for Tango's GC which caches the last page during the "m=
ark" part of the scan. Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=3D1923 Improvement is also almost tenfold :) -- = Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

You are very fast :-)

Splitting a huge block of text by whitespace will cause an array that's
comparable to the size of the original text (a slice is 8 bytes, so if words
are 6-7 characters long by average, the size would be the same).<
Note that in practical situations I don't use a split() on such big chunk of text. In such situation I use xsplit(), the lazy version (that you can find in my d libs. That I hope to see added both to the Python and D string methods/string library) that is "hugely" faster in such situation :-)
Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=1923<
I'd like to try that code; can I do it?
Improvement is also almost tenfold :)<
Sorry for the potentially stupid question: is this an improvement in the text splitting time, in the final garbage collection time when the program is finished, or is it just a blind sum of both? General note: I think more people have to compare the speed of programs written in other languages (like Python, Java, OCaML, Clean, C with GCC, Perl, etc) to the same programs written in D, because this (often?) shows spots where D may need some improvements. I know D is a language in development still, so too much early optimization may be bad, but I think a bit of care in that regard may help anyway. Bye and thank you, bearophile
Mar 14 2008
parent reply "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Fri, 14 Mar 2008 15:00:48 +0200, bearophile <bearophileHUGS lycos.com=
 wrote:
Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=3D1923<
I'd like to try that code; can I do it?
Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild = Phobos (using the makefiles).
Improvement is also almost tenfold :)<
Sorry for the potentially stupid question: is this an improvement in t=
he text splitting time, in the final garbage collection time when the pr= ogram is finished, or is it just a blind sum of both? Of the total running time. The slowdown happens when D starts a garbage = collection cycle while split() is running. Here are the exact times for me: =3D=3D Before =3D=3D loading time: 0.14 splitting time: 0.875 total execution time: 3.234 seconds. =3D=3D After =3D=3D loading time: 0.063 splitting time: 0.156 total execution time: 0.358 seconds.
 General note: I think more people have to compare the speed of program=
s written in other languages (like Python, Java, OCaML, Clean, C with GC= C, Perl, etc) to the same programs written in D, because this (often?) s= hows spots where D may need some improvements. I know D is a language in= development still, so too much early optimization may be bad, but I thi= nk a bit of care in that regard may help anyway. Good optimizations often require significant and complicated changes to = the code. Luckily this one doesn't :) -- = Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Vladimir Panteleev wrote:
 Good optimizations often require significant and complicated changes
 to the code. Luckily this one doesn't :)
What exactly does the patch do?
Mar 14 2008
parent reply "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Sat, 15 Mar 2008 04:48:27 +0200, Walter Bright <newshound1 digitalmar=
s.com> wrote:

 Vladimir Panteleev wrote:
 Good optimizations often require significant and complicated changes
 to the code. Luckily this one doesn't :)
What exactly does the patch do?
In gcx.mark(), it remembers the last pointer it examined that pointed to= a B_PAGE or B_PAGEPLUS memory page. If the next value that looks like a= pointer (is >=3DminAddr && <maxAddr) also points to the same memory pag= e ((lastPointer & ~(PAGESIZE-1)) =3D=3D (newPointer & ~(PAGESIZE-1))), i= t skips looking at it, thus skipping a call to findPool, looking up the = page's bin size, and testing / setting the respective bits. This greatly optimizes the GC for scanning areas that have consecutive p= ointers pointing to the same block of memory, e.g. slices of a bigger ar= ray. Note that I also changed "if (p >=3D minAddr)" to "if (p >=3D minAd= dr && p < maxAddr)" (which should also give a speed boost when scanning = objects with static byte/ubyte/char arrays or floats). -- = Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
Thanks, that makes sense.
Mar 14 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild Phobos
(using the makefiles).<
I'm sure for most of you doing this is really easy, but for me it wasn't an immediate thing. I have succed. Phobos of DMD 1.028 for Win compiled with the patch (I'll keep them online for just few days): http://www.fantascienza.net/leonardo/phobos.7z Or bigger download (use the smaller if possible): http://www.fantascienza.net/leonardo/phobos.zip I suggest interested people to try this patch, to find if it gives improvements or the opposite to code that uses the GC.
Here are the exact times for me:<
My timings, the deallocation is about 9 times faster! OLD TIMINGS, with GC: loading time: 0.41 s splitting time: 3.76 s TOTAL time: 11.4 s OLD TIMINGS, without GC: loading time: 0.40 s splitting time: 0.67 s TOTAL time: 8.25 s NEW TIMINGS, with GC: loading time: 0.39 s splitting time: 0.8 s TOTAL time: 1.71 s NEW TIMINGS, without GC: loading time: 0.39 s splitting time: 0.67 s TOTAL time: 1.58 s The generation of a a long list of matches with a regex like the following one goes from 7.63s to 3.28s: auto txt = cast(string)read("text.txt"); auto words = RegExp("[a-z]+", "g").match(txt); The "bintrees" benchmark of the Shootout version (with the version of the code that doesn't use manual memory management. The code is online in that site, look for my nick in the code) goes (with input n = 15) from 42.1s to 34.1s.
Good optimizations often require significant and complicated changes to the
code. Luckily this one doesn't :)<
I know, things aren't easy, but: - Various tests have shown me that the D GC is (was? With this patch it may be a bit faster in some situations) up to 10-11 times slower than the GC of the last Java, so I presume there's space for improvements. - Comparing the speed of the same programs from D to other fast languages (C with GCC, OcaML, Clean, etc) helps you to *know* the spots where the performance is bad. Then later you can let things as they are, or you can try to improve things. Otherwise you just are ignorant and you can't even start thinking about solving problems. It's like a higher-level form of code (language) profiling :-) And by the way, now the D program I was talking about is about 15% faster than the Python version ;-) Bye and thank you, bearophile
Mar 14 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 My timings, the deallocation is about 9 times faster!
But watch the RAM used too, to me it seems to use more... Bye, bearophile
Mar 14 2008
parent "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Sat, 15 Mar 2008 05:00:14 +0200, bearophile <bearophileHUGS lycos.com> wrote:

 bearophile:
 My timings, the deallocation is about 9 times faster!
But watch the RAM used too, to me it seems to use more...
Odd, it shouldn't. The only memory overhead is one DWORD on the stack. -- Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008