digitalmars.D - Slow GC?

bearophile (35/35) Mar 13 2008 While writing a program I have seen it's quite slow (about 35-40 seconds...

Vladimir Panteleev (10/11) Mar 14 2008 Splitting a huge block of text by whitespace will cause an array that's ...

Vladimir Panteleev (9/10) Mar 14 2008 ark" part of the scan.

bearophile (8/11) Mar 14 2008 Note that in practical situations I don't use a split() on such big chun...

Vladimir Panteleev (26/32) Mar 14 2008 Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild ...

Walter Bright (2/4) Mar 14 2008 What exactly does the patch do?

Vladimir Panteleev (16/20) Mar 14 2008 In gcx.mark(), it remembers the last pointer it examined that pointed to...

Walter Bright (1/1) Mar 14 2008 Thanks, that makes sense.

bearophile (34/37) Mar 14 2008 I'm sure for most of you doing this is really easy, but for me it wasn't...

bearophile (4/5) Mar 14 2008 But watch the RAM used too, to me it seems to use more...

Vladimir Panteleev (5/8) Mar 14 2008 Odd, it shouldn't. The only memory overhead is one DWORD on the stack.

bearophile <bearophileHUGS lycos.com> writes:

While writing a program I have seen it's quite slow (about 35-40 seconds to run
a program that requires less than 8 seconds to run with Python), I have
progressively simplified it, to the following D code, able to show a similar
problem:


import std.file, std.string, std.gc;
import std.stdio: putr = writefln;
static import std.c.time;
double clock() {
    auto t = std.c.time.clock();
    return t/cast(double)std.c.time.CLOCKS_PER_SEC;
}
void main() {
  //disable;
  auto t0 = clock();
  auto txt = cast(string)read("text.txt"); // 6.3 MB of text
  auto t1 = clock();
  auto words = txt.split();
  auto t2 = clock();

  putr("loading time: ", t1 - t0); // 0.08 s
  putr("splitting time: ", t2 - t1); // 3.69 s with GC, 0.66 s without
  // Total running time with GC = 10.85 s
}


As you can see the Running time is about 3.7 seconds, but then it needs about 7
more seconds to deallocate the memory used.

I have written the same program in Python, not using Psyco or disabling the GC
or doing something fancy:


from time import clock
t0 = clock()
txt = file("text.txt").read() // 6.3 MB of text
t1 = clock()
words = txt.split()
t2 = clock()





Splitting is twice faster, but that's not the most important thing to see: when
the program is finished it needs about 1.1 seconds to deallocate its memory and
terminate.
It seems there is some problem in the D GC (in my original D program on a total
running time of about 40 seconds, 17 are necessary for the final memory
deallocation, while the same program in Python needs less than 2 seconds for
the final memory + VirtualMachine deallocations).

Tests performed on a Win2K, 256 MB RAM, pentium3 500 MHz, with Python 2.5.2,
and DMD 1.028. My timings are approximated but they don't need more than a 0.2
s resolution to show the situation.

Bye,
bearophile

Mar 13 2008

"Vladimir Panteleev" <thecybershadow gmail.com> writes:

On Fri, 14 Mar 2008 04:38:23 +0200, bearophile <bearophileHUGS lycos.com> wrote:

 While writing a program I have seen it's quite slow (about 35-40 seconds to
run a program that requires less than 8 seconds to run with Python), I have
progressively simplified it, to the following D code, able to show a similar
problem:

Splitting a huge block of text by whitespace will cause an array that's
comparable to the size of the original text (a slice is 8 bytes, so if words
are 6-7 characters long by average, the size would be the same). The slowdown
appears because the GC needs to look up every one of those pointers, which
point to the same large memory region.

I made a patch for Tango's GC which caches the last page during the "mark" part
of the scan. This reduces the execution time of your program tenfold for me,
but it will only work when there's many consecutive pointers to the same GC
page which is part of a big (>2048 bytes) allocation. There should be no
noticeable performance penalty for the general case (as the overhead is just
one compare and sometimes an assignment), so I've created an enhancement ticket:

http://dsource.org/projects/tango/ticket/982

Interestingly, when I first compiled your program with Tango's runtime, and it
ran a lot slower. I found the bottleneck, then created a ticket with an
explanation and patch here:

http://dsource.org/projects/tango/ticket/981

I also have a patch for Phobos (for expanding arrays, not the GC), but it
doesn't yield a big improvement (about 13% faster).

-- 
Best regards,
 Vladimir                          mailto:thecybershadow gmail.com

Mar 14 2008

"Vladimir Panteleev" <thecybershadow gmail.com> writes:

On Fri, 14 Mar 2008 09:54:22 +0200, Vladimir Panteleev <thecybershadow g=
mail.com> wrote:

 I made a patch for Tango's GC which caches the last page during the "m=

ark" part of the scan.

Phobos patch:
http://d.puremagic.com/issues/show_bug.cgi?id=3D1923

Improvement is also almost tenfold :)

-- =

Best regards,
 Vladimir                          mailto:thecybershadow gmail.com

Mar 14 2008

bearophile <bearophileHUGS lycos.com> writes:

Vladimir Panteleev:

You are very fast :-)

Splitting a huge block of text by whitespace will cause an array that's
comparable to the size of the original text (a slice is 8 bytes, so if words
are 6-7 characters long by average, the size would be the same).<

Note that in practical situations I don't use a split() on such big chunk of
text. In such situation I use xsplit(), the lazy version (that you can find in
my d libs. That I hope to see added both to the Python and D string
methods/string library) that is "hugely" faster in such situation :-)


Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=1923<

I'd like to try that code; can I do it?


Improvement is also almost tenfold :)<

Sorry for the potentially stupid question: is this an improvement in the text
splitting time, in the final garbage collection time when the program is
finished, or is it just a blind sum of both?

General note: I think more people have to compare the speed of programs written
in other languages (like Python, Java, OCaML, Clean, C with GCC, Perl, etc) to
the same programs written in D, because this (often?) shows spots where D may
need some improvements. I know D is a language in development still, so too
much early optimization may be bad, but I think a bit of care in that regard
may help anyway.

Bye and thank you,
bearophile

Mar 14 2008

"Vladimir Panteleev" <thecybershadow gmail.com> writes:

On Fri, 14 Mar 2008 15:00:48 +0200, bearophile <bearophileHUGS lycos.com=
 wrote:

Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=3D1923<

 I'd like to try that code; can I do it?

Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild =
Phobos (using the makefiles).

Improvement is also almost tenfold :)<

 Sorry for the potentially stupid question: is this an improvement in t=

he text splitting time, in the final garbage collection time when the pr=
ogram is finished, or is it just a blind sum of both?

Of the total running time. The slowdown happens when D starts a garbage =
collection cycle while split() is running.

Here are the exact times for me:

=3D=3D Before =3D=3D
loading time: 0.14
splitting time: 0.875
total execution time: 3.234 seconds.

=3D=3D After =3D=3D
loading time: 0.063
splitting time: 0.156
total execution time: 0.358 seconds.

 General note: I think more people have to compare the speed of program=

s written in other languages (like Python, Java, OCaML, Clean, C with GC=
C, Perl, etc) to the same programs written in D, because this (often?) s=
hows spots where D may need some improvements. I know D is a language in=
 development still, so too much early optimization may be bad, but I thi=
nk a bit of care in that regard may help anyway.

Good optimizations often require significant and complicated changes to =
the code. Luckily this one doesn't :)

-- =

Best regards,
 Vladimir                          mailto:thecybershadow gmail.com

Mar 14 2008

Walter Bright <newshound1 digitalmars.com> writes:

Vladimir Panteleev wrote:
 Good optimizations often require significant and complicated changes
 to the code. Luckily this one doesn't :)


What exactly does the patch do?

Mar 14 2008

"Vladimir Panteleev" <thecybershadow gmail.com> writes:

On Sat, 15 Mar 2008 04:48:27 +0200, Walter Bright <newshound1 digitalmar=
s.com> wrote:

 Vladimir Panteleev wrote:
 Good optimizations often require significant and complicated changes
 to the code. Luckily this one doesn't :)


 What exactly does the patch do?

In gcx.mark(), it remembers the last pointer it examined that pointed to=
 a B_PAGE or B_PAGEPLUS memory page. If the next value that looks like a=
 pointer (is >=3DminAddr && <maxAddr) also points to the same memory pag=
e ((lastPointer & ~(PAGESIZE-1)) =3D=3D (newPointer & ~(PAGESIZE-1))), i=
t skips looking at it, thus skipping a call to findPool, looking up the =
page's bin size, and testing / setting the respective bits.

This greatly optimizes the GC for scanning areas that have consecutive p=
ointers pointing to the same block of memory, e.g. slices of a bigger ar=
ray. Note that I also changed "if (p >=3D minAddr)" to "if (p >=3D minAd=
dr && p < maxAddr)" (which should also give a speed boost when scanning =
objects with static byte/ubyte/char arrays or floats).

-- =

Best regards,
 Vladimir                          mailto:thecybershadow gmail.com

Mar 14 2008

Walter Bright <newshound1 digitalmars.com> writes:

Thanks, that makes sense.

Mar 14 2008

bearophile <bearophileHUGS lycos.com> writes:

Vladimir Panteleev:

Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild Phobos
(using the makefiles).<

I'm sure for most of you doing this is really easy, but for me it wasn't an
immediate thing. I have succed.

Phobos of DMD 1.028 for Win compiled with the patch (I'll keep them online for
just few days):
http://www.fantascienza.net/leonardo/phobos.7z
Or bigger download (use the smaller if possible):
http://www.fantascienza.net/leonardo/phobos.zip
I suggest interested people to try this patch, to find if it gives improvements
or the opposite to code that uses the GC.


Here are the exact times for me:<

My timings, the deallocation is about 9 times faster!

OLD TIMINGS, with GC:
loading time: 0.41 s
splitting time: 3.76 s
TOTAL time: 11.4 s

OLD TIMINGS, without GC:
loading time: 0.40 s
splitting time: 0.67 s
TOTAL time: 8.25 s


NEW TIMINGS, with GC:
loading time: 0.39 s
splitting time: 0.8 s
TOTAL time: 1.71 s

NEW TIMINGS, without GC:
loading time: 0.39 s
splitting time: 0.67 s
TOTAL time: 1.58 s


The generation of a a long list of matches with a regex like the following one
goes from 7.63s to 3.28s:
auto txt = cast(string)read("text.txt");
auto words = RegExp("[a-z]+", "g").match(txt);


The "bintrees" benchmark of the Shootout version (with the version of the code
that doesn't use manual memory management. The code is online in that site,
look for my nick in the code) goes (with input n = 15) from 42.1s to 34.1s.


Good optimizations often require significant and complicated changes to the
code. Luckily this one doesn't :)<

I know, things aren't easy, but:
- Various tests have shown me that the D GC is (was? With this patch it may be
a bit faster in some situations) up to 10-11 times slower than the GC of the
last Java, so I presume there's space for improvements.
- Comparing the speed of the same programs from D to other fast languages (C
with GCC, OcaML, Clean, etc) helps you to *know* the spots where the
performance is bad. Then later you can let things as they are, or you can try
to improve things. Otherwise you just are ignorant and you can't even start
thinking about solving problems. It's like a higher-level form of code
(language) profiling :-)

And by the way, now the D program I was talking about is about 15% faster than
the Python version ;-)

Bye and thank you,
bearophile

Mar 14 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile:
 My timings, the deallocation is about 9 times faster!

But watch the RAM used too, to me it seems to use more...

Bye,
bearophile

Mar 14 2008

"Vladimir Panteleev" <thecybershadow gmail.com> writes:

On Sat, 15 Mar 2008 05:00:14 +0200, bearophile <bearophileHUGS lycos.com> wrote:

 bearophile:
 My timings, the deallocation is about 9 times faster!

 But watch the RAM used too, to me it seems to use more...

Odd, it shouldn't. The only memory overhead is one DWORD on the stack.

-- 
Best regards,
 Vladimir                          mailto:thecybershadow gmail.com

Mar 14 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Slow GC?