www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Quantifying the Performance of Garbage Collection vs. Explicit Memory Management

reply "Craig Black" <craigblack2 cox.net> writes:
Interesting article:
http://lambda-the-ultimate.org/node/2552

-Craig
Mar 16 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Craig,

 Interesting article:
 http://lambda-the-ultimate.org/node/2552
 -Craig
 

That test sounds bogus to me. The results for manual memory management are effectively unbeatable. I rather suspect that most program lose the last reference to memory about the same time you can tell it won't be used (so it's almost as quick as possible there) and it has no overhead at all including the accounting needed to keep track of the memory. Add back in the post processing time and then tell us the numbers! What I want to see is: rebuild a Java app with smart pointer or something, or how about spec a large function block as detailed as you can without knowing if you will be using GC or not and then implement it both ways (sever times by sever people). Then compare those.
Mar 16 2008
parent reply "Craig Black" <craigblack2 cox.net> writes:
"BCS" <ao pathlink.com> wrote in message 
news:55391cb32a83e8ca55bd841aba20 news.digitalmars.com...
 Reply to Craig,

 Interesting article:
 http://lambda-the-ultimate.org/node/2552
 -Craig

That test sounds bogus to me. The results for manual memory management are effectively unbeatable. I rather suspect that most program lose the last reference to memory about the same time you can tell it won't be used (so it's almost as quick as possible there) and it has no overhead at all including the accounting needed to keep track of the memory. Add back in the post processing time and then tell us the numbers! What I want to see is: rebuild a Java app with smart pointer or something, or how about spec a large function block as detailed as you can without knowing if you will be using GC or not and then implement it both ways (sever times by sever people). Then compare those.

If I understand the article correctly, I think you are missing the main point of it. It is not merely saying ,"We ran conclusive test: GC loses to explicit memory management!" They went much further than that, and identified the primary bottleneck in modern GC implementations. Namely, that GC doesn't cooperate well with virtual memory. GC typically scans way more memory than the rest of the application, and consequently causes the most cache misses/page faults. This article is not bashing GC. I think it is indicating to us how we can best improve it. Perhaps some GC algorithm could be developed that is optimized specifically to produce the fewer page faults and cache misses. Thoughts? -Craig
Mar 16 2008
next sibling parent reply renoX <renosky free.fr> writes:
Craig Black a écrit :
 "BCS" <ao pathlink.com> wrote in message 
 news:55391cb32a83e8ca55bd841aba20 news.digitalmars.com...
 Reply to Craig,

 Interesting article:
 http://lambda-the-ultimate.org/node/2552
 -Craig

That test sounds bogus to me. The results for manual memory management are effectively unbeatable. I rather suspect that most program lose the last reference to memory about the same time you can tell it won't be used (so it's almost as quick as possible there) and it has no overhead at all including the accounting needed to keep track of the memory. Add back in the post processing time and then tell us the numbers! What I want to see is: rebuild a Java app with smart pointer or something, or how about spec a large function block as detailed as you can without knowing if you will be using GC or not and then implement it both ways (sever times by sever people). Then compare those.

If I understand the article correctly, I think you are missing the main point of it. It is not merely saying ,"We ran conclusive test: GC loses to explicit memory management!" They went much further than that, and identified the primary bottleneck in modern GC implementations. Namely, that GC doesn't cooperate well with virtual memory. GC typically scans way more memory than the rest of the application, and consequently causes the most cache misses/page faults. This article is not bashing GC. I think it is indicating to us how we can best improve it. Perhaps some GC algorithm could be developed that is optimized specifically to produce the fewer page faults and cache misses. Thoughts?

The same authors have made a VM friendly GC, it is linked at the end the article.. http://lambda-the-ultimate.org/node/2391 The are two drawback of their GC: 1) The OS needs to be modified so that the VM and the GC communicate. 2) It is a moving GC, in the original paper they compares their GC with moving and non-moving variant and the moving one win. For Java this isn't an issue, but for D it is: working with C's library wouldn't work if the GC is a moving one.. I don't know how to fix this one.. Regards, renoX
 
 -Craig
 

Mar 16 2008
next sibling parent reply "Craig Black" <cblack ara.com> writes:
 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.

Perhaps not. There may be other ways to detect how the VM for virtual memory (not virtual machine) works without specific OS queries. It could be part of the application initialization to determine this. I don't know enough details to know whether this is possible.
 2) It is a moving GC, in the original paper they compares their GC with 
 moving and non-moving variant and the moving one win.

 For Java this isn't an issue, but for D it is: working with C's library 
 wouldn't work if the GC is a moving one..

 I don't know how to fix this one..

Perhaps we need to transition away from reliance on non-GC friendly C library functions? IMO, a moving GC for D should a high priority, and well worth the extra work. -Craig
Mar 17 2008
parent renoX <renosky free.fr> writes:
Craig Black Wrote:

 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.

Perhaps not. There may be other ways to detect how the VM for virtual memory (not virtual machine) works without specific OS queries. It could be part of the application initialization to determine this. I don't know enough details to know whether this is possible.

I doubt it very much: how would the application know what is the memory pressure without querying the OS in some way or another? Their paper is quite detailed about how their GC works.. Regards, renoX
Mar 18 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from renoX (renosky free.fr)'s article
 The same authors have made a VM friendly GC, it is linked at the end the
 article..
 http://lambda-the-ultimate.org/node/2391
 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.
 2) It is a moving GC, in the original paper they compares their GC with
 moving and non-moving variant and the moving one win.
 For Java this isn't an issue, but for D it is: working with C's library
 wouldn't work if the GC is a moving one..
 I don't know how to fix this one..

There is published research regarding VM-aware garbage collectors, and creating one for D wouldn't be terribly difficult. Portability is more of an issue, as you've mentioned. I believe plugging one into Windows isn't terribly difficult, but for *nix it can require a kernel re-compile to get access to the proper hooks. Because of this, I don't think that VM-aware GCs are the best general purpose solution, even though their performance can be quite good. Sean
Mar 17 2008
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 There is published research regarding VM-aware garbage collectors, and
creating one for D wouldn't be
 terribly difficult.

Oops. That blog links to one of the papers I was thinking of, so I guess you're already aware of it :-) Sean
Mar 17 2008
prev sibling next sibling parent reply "Craig Black" <cblack ara.com> writes:
"Sean Kelly" <sean invisibleduck.org> wrote in message 
news:frmhe4$20jp$1 digitalmars.com...
 == Quote from renoX (renosky free.fr)'s article
 The same authors have made a VM friendly GC, it is linked at the end the
 article..
 http://lambda-the-ultimate.org/node/2391
 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.
 2) It is a moving GC, in the original paper they compares their GC with
 moving and non-moving variant and the moving one win.
 For Java this isn't an issue, but for D it is: working with C's library
 wouldn't work if the GC is a moving one..
 I don't know how to fix this one..

There is published research regarding VM-aware garbage collectors, and creating one for D wouldn't be terribly difficult. Portability is more of an issue, as you've mentioned. I believe plugging one into Windows isn't terribly difficult, but for *nix it can require a kernel re-compile to get access to the proper hooks. Because of this, I don't think that VM-aware GCs are the best general purpose solution, even though their performance can be quite good. Sean

I don't think the *nix kernel recompile issue should be a show stopper. I would be very happy with a prototype that was Windows only. Once we have it working and can measure the benefits, then tackle the *nix issues. A fast GC would really help D to be competitive. It would be nice if the D community was leading GC innovation (like Tango just did with the XML parser). -Craig
Mar 17 2008
parent reply renoX <renosky free.fr> writes:
Craig Black Wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:frmhe4$20jp$1 digitalmars.com...
 == Quote from renoX (renosky free.fr)'s article
 The same authors have made a VM friendly GC, it is linked at the end the
 article..
 http://lambda-the-ultimate.org/node/2391
 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.
 2) It is a moving GC, in the original paper they compares their GC with
 moving and non-moving variant and the moving one win.
 For Java this isn't an issue, but for D it is: working with C's library
 wouldn't work if the GC is a moving one..
 I don't know how to fix this one..

There is published research regarding VM-aware garbage collectors, and creating one for D wouldn't be terribly difficult. Portability is more of an issue, as you've mentioned. I believe plugging one into Windows isn't terribly difficult, but for *nix it can require a kernel re-compile to get access to the proper hooks. Because of this, I don't think that VM-aware GCs are the best general purpose solution, even though their performance can be quite good. Sean

I don't think the *nix kernel recompile issue should be a show stopper. I would be very happy with a prototype that was Windows only.

Uh? Does Windows VM provides enough information as needed by the GC in the paper or are you thinking about a totally different GC? For the records, the authors of the GC in the paper submitted their VM patch for inclusion in the standard Linux kernel, but they didn't manage to convince kernel maintainers to include it.. At the time, the JVM was proprietary, if their GC goes into the JVM now that it is GPL maybe it would be a stronger argument. Regards, renoX
Mar 18 2008
parent reply "Craig Black" <craigblack2 cox.net> writes:
"renoX" <renosky free.fr> wrote in message 
news:frnvkn$2na5$1 digitalmars.com...
 Craig Black Wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message
 news:frmhe4$20jp$1 digitalmars.com...
 == Quote from renoX (renosky free.fr)'s article
 The same authors have made a VM friendly GC, it is linked at the end 
 the
 article..
 http://lambda-the-ultimate.org/node/2391
 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.
 2) It is a moving GC, in the original paper they compares their GC 
 with
 moving and non-moving variant and the moving one win.
 For Java this isn't an issue, but for D it is: working with C's 
 library
 wouldn't work if the GC is a moving one..
 I don't know how to fix this one..

There is published research regarding VM-aware garbage collectors, and creating one for D wouldn't be terribly difficult. Portability is more of an issue, as you've mentioned. I believe plugging one into Windows isn't terribly difficult, but for *nix it can require a kernel re-compile to get access to the proper hooks. Because of this, I don't think that VM-aware GCs are the best general purpose solution, even though their performance can be quite good. Sean

I don't think the *nix kernel recompile issue should be a show stopper. I would be very happy with a prototype that was Windows only.

Uh? Does Windows VM provides enough information as needed by the GC in the paper or are you thinking about a totally different GC?

Are you asking me or Sean? Sean made the claim that "Windows isn't terribly difficult". If he's right, then Windows provides some sort of provision for VM querying for stuff like this.
 For the records, the authors of the GC in the paper submitted their VM 
 patch for inclusion in the standard Linux kernel, but they didn't manage 
 to convince kernel maintainers to include it.. At the time, the JVM was 
 proprietary, if their GC goes into the JVM now that it is GPL maybe it 
 would be a stronger argument.

 Regards,
 renoX

It makes sense to me to solve the problem in Windows first (since Sean says it won't be that bad), and then tackle the Unix/Linux kernel recompile issue. -Craig
Mar 18 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Craig Black (craigblack2 cox.net)'s article
 "renoX" <renosky free.fr> wrote in message
 Uh? Does Windows VM provides enough information as needed by the GC in the
 paper or are you thinking about a totally different GC?

difficult". If he's right, then Windows provides some sort of provision for VM querying for stuff like this.

There are a number of papers on this topic--one being the paper linked in that blog and I believe Hans Boehm wrote another one. From what I remember of them, I believe they said that Windows exposes the necessary hooks already, but a kernel recompile was necessary to do the same thing on Linux. That sounds right as well, since a kernel recompile isn't really possible on Windows anyway, so either the feature would be available or it wouldn't. The real work would be a fundamental rewrite of how the mark phase operates. For a dedicated programmer, I'd estimate it would take about a week. Sean
Mar 19 2008
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Sean Kelly wrote:
 == Quote from Craig Black (craigblack2 cox.net)'s article
 "renoX" <renosky free.fr> wrote in message
 Uh? Does Windows VM provides enough information as needed by the GC in the
 paper or are you thinking about a totally different GC?

difficult". If he's right, then Windows provides some sort of provision for VM querying for stuff like this.

There are a number of papers on this topic--one being the paper linked in that blog and I believe Hans Boehm wrote another one. From what I remember of them, I believe they said that Windows exposes the necessary hooks already, but a kernel recompile was necessary to do the same thing on Linux. That sounds right as well, since a kernel recompile isn't really possible on Windows anyway, so either the feature would be available or it wouldn't. The real work would be a fundamental rewrite of how the mark phase operates. For a dedicated programmer, I'd estimate it would take about a week.

IIRC (I read the paper a while back) a full implementation of their idea is a moving GC[1] which may take a bit longer to implement. It not only impacts the mark phase, but also the sweep phase[2] and it requires compiler modifications as well; it'll need to supply the GC with exact information about pointers in static data, on the heap and on the stack. Though I might be wrong; anyone care to implement this in GDC (+ gphobos and/or tango)? :) [1]: IIRC: When the OS indicates to their GC that it's about to swap some of their pages out, it starts a collection and attempts to move the live objects from multiple partially-used pages into fewer pages so the now-free pages can be released to the OS to prevent the swapping from taking place (or swap out the now-empty page they'll then avoid using, I'm not sure exactly). The cool thing is that they apparently managed to get that working pretty well while mostly avoiding pages being swapped back in because of collection activity. [2]: The equivalent of the current sweep phase will need to move objects and update all pointers to them, without modifying values that only look like pointers.
Mar 19 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Frits van Bommel (fvbommel REMwOVExCAPSs.nl)'s article
 [1]: IIRC: When the OS indicates to their GC that it's about to swap
 some of their pages out, it starts a collection and attempts to move the
 live objects from multiple partially-used pages into fewer pages so the
 now-free pages can be released to the OS to prevent the swapping from
 taking place (or swap out the now-empty page they'll then avoid using,
 I'm not sure exactly).
 The cool thing is that they apparently managed to get that working
 pretty well while mostly avoiding pages being swapped back in because of
 collection activity.

Ah nice. What I remember about the papers I read was that they simply scanned through pages being swapped out to bookmark any data being referenced. However, it seems like doing things this way would require every pointer to be evaluated before being followed to avoid hitting an inactive page. Sean
Mar 19 2008
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Sean Kelly wrote:
 == Quote from Frits van Bommel (fvbommel REMwOVExCAPSs.nl)'s article
 [1]: IIRC: When the OS indicates to their GC that it's about to swap
 some of their pages out, it starts a collection and attempts to move the
 live objects from multiple partially-used pages into fewer pages so the
 now-free pages can be released to the OS to prevent the swapping from
 taking place (or swap out the now-empty page they'll then avoid using,
 I'm not sure exactly).
 The cool thing is that they apparently managed to get that working
 pretty well while mostly avoiding pages being swapped back in because of
 collection activity.

Ah nice. What I remember about the papers I read was that they simply scanned through pages being swapped out to bookmark any data being referenced. However, it seems like doing things this way would require every pointer to be evaluated before being followed to avoid hitting an inactive page.

Yes, they'd need to locate the metadata for the memory block a pointer refers to. I remember it being stored in the first page of the 'superpage' containing the block (a superpage is an aligned block of 2^N pages for some constant N, essentially a pool of memory blocks of a certain size with some metadata[1] in front of it). They set it up so that they never allowed those first pages to be swapped out. Since superpages are aligned a pointer to a GC-ed object can be converted into a superpage pointer by simply masking out the lower bits, but they still need to verify that it is in fact a valid superpage (i.e. that it *is* a GC-ed object) which would require some sort of lookup (hashtable?) on that pointer. IIRC they used 16-page superpages, requiring one lookup table entry per 64 KiB allocated (minus superpage metadata). [1]: IIRC they considered complete metadata (i.e. storing for each objects the number of references from swapped-out pages) to be too costly, so they instead implemented a partial scheme: A single reference count per page (or superpage, I'm not sure) and a bit "was this ever referenced from swap" per object, only clearing those bits when the reference counter for the containing (super?)page reached zero. They then only allowed the GC to move objects when their bit was clear. (But they could be moved just before the bit was set, so they could be put into pages that already had such objects in them in an attempt to limit the number of such pages. This allows more choice for the GC when it comes time to decide which pages to empty when another swap-out is imminent)
Mar 19 2008
prev sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
Sean Kelly wrote:
 == Quote from renoX (renosky free.fr)'s article
 The same authors have made a VM friendly GC, it is linked at the end the
 article..
 http://lambda-the-ultimate.org/node/2391
 The are two drawback of their GC:
 1) The OS needs to be modified so that the VM and the GC communicate.
 2) It is a moving GC, in the original paper they compares their GC with
 moving and non-moving variant and the moving one win.
 For Java this isn't an issue, but for D it is: working with C's library
 wouldn't work if the GC is a moving one..
 I don't know how to fix this one..

There is published research regarding VM-aware garbage collectors, and creating one for D wouldn't be terribly difficult. Portability is more of an issue, as you've mentioned. I believe plugging one into Windows isn't terribly difficult, but for *nix it can require a kernel re-compile to get access to the proper hooks. Because of this, I don't think that VM-aware GCs are the best general purpose solution, even though their performance can be quite good. Sean

What's wrong with having different collectors for different OSes? In fact, with Tango's pluggable GC, I can see it being very easy to get collectors tailor-made to the abuse a particular piece of software plans to put on them. For example, it would be possible to create a moving collector or metronomic collector and compile that in. That would impose a number of restrictions/requirements on the programmer which a general mark-and-sweep would not, but if the user was explicitly compiling that in, he would be aware of that.
Mar 17 2008
prev sibling parent reply BCS <BCS pathlink.com> writes:
Craig Black wrote:
 
 If I understand the article correctly, I think you are missing the main 
 point of it.  It is not merely saying ,"We ran conclusive test: GC loses 
 to explicit memory management!"  They went much further than that, and 
 identified the primary bottleneck in modern GC implementations.  Namely, 
 that GC doesn't cooperate well with virtual memory.  GC typically scans 
 way more memory than the rest of the application, and consequently 
 causes the most cache misses/page faults.  This article is not bashing 
 GC.  I think it is indicating to us how we can best improve it.  Perhaps 
 some GC algorithm could be developed that is optimized specifically to 
 produce the fewer page faults and cache misses.
 

I didn't read the whole article (only the segments that were quoted) so I may have missed the point. The quoting article seem to me to be using this to claim that GC looses. If that's not the main point... Ok, my bad.
 Thoughts?

I built a version of lisp a few months back that needed some sort of libsegv library for it's GC. it sounds like it might be of use. Something like only scanning pages that have changed and caching the rest.
 
 -Craig
 

Mar 17 2008
parent reply Dan <murpsoft hotmail.com> writes:
BCS Wrote:

 Craig Black wrote:
 
 If I understand the article correctly, I think you are missing the main 
 point of it.  It is not merely saying ,"We ran conclusive test: GC loses 
 to explicit memory management!"  They went much further than that, and 
 identified the primary bottleneck in modern GC implementations.  Namely, 
 that GC doesn't cooperate well with virtual memory.  GC typically scans 
 way more memory than the rest of the application, and consequently 
 causes the most cache misses/page faults.  This article is not bashing 
 GC.  I think it is indicating to us how we can best improve it.  Perhaps 
 some GC algorithm could be developed that is optimized specifically to 
 produce the fewer page faults and cache misses.


I typically use GC, but when I don't it gets linked anyways. Bloat. As for preventing cache/page misses... Some blocks will be in cache (a), some pages will be in memory (b), and some will be on swap (c). Set (b) and set (c) can be differentiated by OS accounting. (a) is a subset of (b) does not overlap (c). Set (a) and set (b) cannot be differentiated by any explicit accounting. - Any system which allowed you to not analyze memory stored on swap will definitely improve performance. - Performing a cache miss causes the requested data not to be available for a guestimated 100 cycles. Prefetching batches of memory that you want to analyze, and analyzing it as it actually arrives means you could dramatically reduce cache miss overhead. - Not allowing pointers to be implicitly cast from other types might allow you to store pointers in the GC to all pointers, and store simple patterns for arrays of structs containing pointers and arrays of pointers. This could let you skip any pages that clearly don't have pointer information. This would become sensible if pointers were treated as a typedef'd fully qualified integer type (with shift, add etc working on them), and an alias of '*'. As we found with C, allowing one type to be used to mean two different things is bad, so I would recommend a p32 and a p64. In fact, causing all pointers and array {ptr,length} structs to be colocated where possible in a program would *dramatically* improve GC performance as only a few pages would ever need to be brought into cache during a GC phase - the same that would probably already be in cache during normal program operation. This is probably already true; but the effect would be more dramatic if the GC didn't need to analyze the contents of a pointer because it knew the contents themselves didn't contain another pointer? Best Regards, Dan
Mar 17 2008
parent renoX <renosky free.fr> writes:
Dan Wrote:
 In fact, causing all pointers and array {ptr,length} structs to be colocated
where possible in a program would *dramatically* improve GC performance as only
a few pages would ever need to be brought into cache during a GC phase - the
same that would probably already be in cache during normal program operation.

The question is if the 'where possible' is that big if you have an array of objects which contain pointers, then it isn't possible to colocate them with pointers.. Note that many allocators sort objects by their size so an array {ptr,length} and its data would be usually on different pages already. About the typing issue, yes having strong typing help a lot a GC, but this would imply that interoperability with C would have to use a JNI-like system instead of the easy linking that we have now as C is weakly typed.. Regards, renoX
Mar 18 2008
prev sibling parent "Vladimir Panteleev" <thecybershadow gmail.com> writes:
On Mon, 17 Mar 2008 22:23:53 +0200, Craig Black <cblack ara.com> wrote:

 I don't think the *nix kernel recompile issue should be a show stopper.  I
 would be very happy with a prototype that was Windows only.  Once we have it
 working and can measure the benefits, then tackle the *nix issues.  A fast
 GC would really help D to be competitive.  It would be nice if the D
 community was leading GC innovation (like Tango just did with the XML
 parser).

Actually, the article gave me an interesting idea of a generational GC. I may post a prototype today (Windows-only for now). -- Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 17 2008