www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Precise GC

reply Walter Bright <newshound2 digitalmars.com> writes:
Of course, many of us have been thinking about this for a looong time, and what 
is the best way to go about it. The usual technique is for the compiler to emit 
some sort of table for each TypeInfo giving the layout of the object, i.e.
where 
the pointers are.

The general problem with these is the table is non-trivial, as it will require 
things like iterated data blocks, etc. It has to be compressed to save space, 
and the gc then has to execute a fair amount of code to decode it.

It also requires some significant work on the compiler end, leading of course
to 
complexity, rigidity, development bottlenecks, and the usual bugs.

An alternative Andrei and I have been talking about is to put in the TypeInfo a 
pointer to a function. That function will contain customized code to mark the 
pointers in an instance of that type. That custom code will be generated by a 
template defined by the library. All the compiler has to do is stupidly 
instantiate the template for the type, and insert an address to the generated 
function.

The compiler need know NOTHING about how the marking works.

Even better, as ctRegex has demonstrated, the custom generated code can be
very, 
very fast compared with a runtime table-driven approach. (The slow part will be 
calling the function indirectly.)

And best of all, the design is pushed out of the compiler into the library, so 
various schemes can be tried out without needing compiler work.

I think this is an exciting idea, it will enable us to get a precise gc by 
enabling people to work on it in parallel rather than serially waiting for me.
Apr 07 2012
next sibling parent "Froglegs" <lugtug gmail.com> writes:
  That sounds cool, perhaps people can have customizable GC for
specific applications?

Looking forward to D having a precise GC
Apr 07 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/7/12 8:56 PM, Walter Bright wrote:
[snip]
 I think this is an exciting idea, it will enable us to get a precise gc
 by enabling people to work on it in parallel rather than serially
 waiting for me.

I'm also very excited about this design, and will make time to help with the library part of the implementation. Maybe we can get a GSoC project on that. We already have a related proposal (lock-free GC). Andrei
Apr 07 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/7/12 9:49 PM, Andrei Alexandrescu wrote:
 On 4/7/12 8:56 PM, Walter Bright wrote:
 [snip]
 I think this is an exciting idea, it will enable us to get a precise gc
 by enabling people to work on it in parallel rather than serially
 waiting for me.

I'm also very excited about this design, and will make time to help with the library part of the implementation. Maybe we can get a GSoC project on that. We already have a related proposal (lock-free GC).

BTW one exciting thing about both this and the nascent attributes design is they integrate wonderful with language's generic and generative capabilities. Andrei
Apr 07 2012
prev sibling next sibling parent reply "Antti-Ville Tuunainen" <avtuunainen gmail.com> writes:
On Sunday, 8 April 2012 at 02:49:34 UTC, Andrei Alexandrescu 
wrote:
 Maybe we can get a GSoC project on that. We already have a 
 related proposal (lock-free GC).

That would be me. Just though I should wave and say hello. Are there other GSoC proposals in the GC area?
Apr 14 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-14 19:46, Antti-Ville Tuunainen wrote:
 On Sunday, 8 April 2012 at 02:49:34 UTC, Andrei Alexandrescu wrote:
 Maybe we can get a GSoC project on that. We already have a related
 proposal (lock-free GC).

That would be me. Just though I should wave and say hello. Are there other GSoC proposals in the GC area?

There is: http://prowiki.org/wiki4d/wiki.cgi?GSOC_2012_Ideas -- /Jacob Carlborg
Apr 14 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-04-15 04:11, Antti-Ville Tuunainen wrote:
 On Saturday, 14 April 2012 at 19:17:02 UTC, Jacob Carlborg wrote:

 There is: http://prowiki.org/wiki4d/wiki.cgi?GSOC_2012_Ideas

That's the ideas list for proposals. I was asking if anyone else applied for GSoC using one of those.

Aha, I see. The chosen projects have not been announced yet. -- /Jacob Carlborg
Apr 15 2012
prev sibling parent "Antti-Ville Tuunainen" <avtuunainen gmail.com> writes:
On Saturday, 14 April 2012 at 19:17:02 UTC, Jacob Carlborg wrote:

 There is: http://prowiki.org/wiki4d/wiki.cgi?GSOC_2012_Ideas

That's the ideas list for proposals. I was asking if anyone else applied for GSoC using one of those.
Apr 14 2012
prev sibling next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
Hey, that sounds awesome.  I think I geeked out a bit.

Would this make it any easier to reference count types that can be 
statically proven to have no cyclical references?
Apr 07 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/7/2012 7:58 PM, Chad J wrote:
 Hey, that sounds awesome. I think I geeked out a bit.

 Would this make it any easier to reference count types that can be statically
 proven to have no cyclical references?

It has nothing to do with reference counting that I can think of.
Apr 07 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/7/12 9:59 PM, Walter Bright wrote:
 On 4/7/2012 7:58 PM, Chad J wrote:
 Hey, that sounds awesome. I think I geeked out a bit.

 Would this make it any easier to reference count types that can be
 statically
 proven to have no cyclical references?

It has nothing to do with reference counting that I can think of.

Nevertheless good food for thought. This is all very cool. Andrei
Apr 07 2012
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 04/08/2012 03:56 AM, Walter Bright wrote:
 Of course, many of us have been thinking about this for a looong time,
 and what is the best way to go about it. The usual technique is for the
 compiler to emit some sort of table for each TypeInfo giving the layout
 of the object, i.e. where the pointers are.

 The general problem with these is the table is non-trivial, as it will
 require things like iterated data blocks, etc. It has to be compressed
 to save space, and the gc then has to execute a fair amount of code to
 decode it.

 It also requires some significant work on the compiler end, leading of
 course to complexity, rigidity, development bottlenecks, and the usual
 bugs.

 An alternative Andrei and I have been talking about is to put in the
 TypeInfo a pointer to a function. That function will contain customized
 code to mark the pointers in an instance of that type. That custom code
 will be generated by a template defined by the library. All the compiler
 has to do is stupidly instantiate the template for the type, and insert
 an address to the generated function.

 The compiler need know NOTHING about how the marking works.

 Even better, as ctRegex has demonstrated, the custom generated code can
 be very, very fast compared with a runtime table-driven approach. (The
 slow part will be calling the function indirectly.)

 And best of all, the design is pushed out of the compiler into the
 library, so various schemes can be tried out without needing compiler work.

 I think this is an exciting idea, it will enable us to get a precise gc
 by enabling people to work on it in parallel rather than serially
 waiting for me.

That actually sounds like a pretty awesome idea.
Apr 08 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 04/08/2012 10:45 AM, Timon Gehr wrote:
 That actually sounds like a pretty awesome idea.

Make sure that the compiler does not actually rely on the fact that the template generates a function. The design should include the possibility of just generating tables. It all should be completely transparent to the compiler, if that is possible.
Apr 08 2012
parent reply =?UTF-8?B?QWxleCBSw7hubmUgUGV0ZXJzZW4=?= <xtzgzorex gmail.com> writes:
On 08-04-2012 11:42, Manu wrote:
 On 8 April 2012 11:56, Timon Gehr <timon.gehr gmx.ch
 <mailto:timon.gehr gmx.ch>> wrote:

     On 04/08/2012 10:45 AM, Timon Gehr wrote:

         That actually sounds like a pretty awesome idea.


     Make sure that the compiler does not actually rely on the fact that
     the template generates a function. The design should include the
     possibility of just generating tables. It all should be completely
     transparent to the compiler, if that is possible.


 This sounds important to me. If it is also possible to do the work with
 generated tables, and not calling thousands of indirect functions in
 someone's implementation, it would be nice to reserve that possibility.
 Indirect function calls in hot loops make me very nervous for non-x86
 machines.

Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywhere else, it's really not what you want in a GC. -- - Alex
Apr 08 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 08/04/2012 14:02, Alex Rønne Petersen a écrit :
 On 08-04-2012 11:42, Manu wrote:
 On 8 April 2012 11:56, Timon Gehr <timon.gehr gmx.ch
 <mailto:timon.gehr gmx.ch>> wrote:

 On 04/08/2012 10:45 AM, Timon Gehr wrote:

 That actually sounds like a pretty awesome idea.


 Make sure that the compiler does not actually rely on the fact that
 the template generates a function. The design should include the
 possibility of just generating tables. It all should be completely
 transparent to the compiler, if that is possible.


 This sounds important to me. If it is also possible to do the work with
 generated tables, and not calling thousands of indirect functions in
 someone's implementation, it would be nice to reserve that possibility.
 Indirect function calls in hot loops make me very nervous for non-x86
 machines.

Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywhere else, it's really not what you want in a GC.

Nothing prevent the generated function to itself call other generated functions, when things are predictable. It avoid many indirect calls, and purely by lib, which is good (can be tuned for application/plateform).
Apr 09 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 09/04/2012 20:33, Manu a écrit :
 Eh?
 Not sure what you mean. The idea is the template would produce a
 struct/table of data instead of being a pointer to a function, this way
 the GC could work without calling anything. If the GC was written to
 assume GC info in a particular format/structure, it could be written
 without any calls.
 I'm just saying to leave that as a possibility, and not REQUIRE an
 indirect function call for every single allocation in the system. Some
 GC might be able to make better use of that sort of setup.

If you have reference to objects, you can't avoid a function call. If you have something you know at compile time, the generated function can directly call the other function that mark the pointed data (or even can do it itself, if you don't fear code bloat) without going back to the GC and its indirect call. So it make no difference in the number of indirect calls you have, but the struct proposal is a stronger constraint on the GC that the function one. BTW, starting you answer by « Not sure what you mean. » should have been a red flag.
Apr 09 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 10/04/2012 00:39, Manu a écrit :
 It is, and I still don't follow. I can't imagine there are any indirect
 function calls, except for the ones introduced by this proposal, where
 you may register a function to mark the pointers in complex structs.
 You seem to be suggesting that another one already exists anyway? Where
 is it? Why is it there?

OK, back to basics. For every type, a function template (let's call it GCscan) will be instantiated to scan it. This function can be ANY code. ANY code include the possibility for GCscan!A to call GCscan!B directly, without going back to GC main loop and indirect call. If inlined, you can forget about function call at all (and you can force that using mixin template for example, but it is likely to massively generate code bloat). This can't be done for reference/pointer to polymorphic types, but for any other it is an available option, and it can reduce dramatically the number of indirect calls.
Apr 10 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/10/12 3:03 AM, deadalnix wrote:
 For every type, a function template (let's call it GCscan) will be
 instantiated to scan it. This function can be ANY code. ANY code include
 the possibility for GCscan!A to call GCscan!B directly, without going
 back to GC main loop and indirect call. If inlined, you can forget about
 function call at all (and you can force that using mixin template for
 example, but it is likely to massively generate code bloat).

That is correct. The code bloat is equal to that generated if the end user sat down and wrote by hand appropriate routines for collection for all types.
 This can't be done for reference/pointer to polymorphic types, but for
 any other it is an available option, and it can reduce dramatically the
 number of indirect calls.

Indeed. For non-final class member variables, the template will fetch their Typeinfo, from that the pointer to the mark function, and will issue the call through pointer. Andrei
Apr 10 2012
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 04/08/2012 10:45 AM, Timon Gehr wrote:
 That actually sounds like a pretty awesome idea.

I understand that the stack will still have to be scanned conservatively, but how does the scheme deal with closures?
Apr 08 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/8/2012 2:21 AM, Timon Gehr wrote:
 On 04/08/2012 10:45 AM, Timon Gehr wrote:
 That actually sounds like a pretty awesome idea.

I understand that the stack will still have to be scanned conservatively, but how does the scheme deal with closures?

For now, just treat them conservatively.
Apr 08 2012
prev sibling parent reply Rainer Schuetze <r.sagitario gmx.de> writes:
On 4/8/2012 11:21 AM, Timon Gehr wrote:
 On 04/08/2012 10:45 AM, Timon Gehr wrote:
 That actually sounds like a pretty awesome idea.

I understand that the stack will still have to be scanned conservatively, but how does the scheme deal with closures?

I guess the compiler should generate an (anonymous) struct type corresponding to the closure data layout. There probably has to be a template for compiler generated structs or classes anyway. This new type could also be used as the type of the context pointer, so a debugger could display the closure variables.
Apr 08 2012
parent =?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:
On 08-04-2012 12:07, Rainer Schuetze wrote:
 On 4/8/2012 11:21 AM, Timon Gehr wrote:
 On 04/08/2012 10:45 AM, Timon Gehr wrote:
 That actually sounds like a pretty awesome idea.

I understand that the stack will still have to be scanned conservatively, but how does the scheme deal with closures?

I guess the compiler should generate an (anonymous) struct type corresponding to the closure data layout. There probably has to be a template for compiler generated structs or classes anyway. This new type could also be used as the type of the context pointer, so a debugger could display the closure variables.

This sounds sensible to me. No reason closure marking can't be precise if the compiler just emits the relevant type info (pretty much any other compiler with closures does this; see C#, F#, etc). -- - Alex
Apr 08 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--002354470cb840c03604bd27b6bb
Content-Type: text/plain; charset=UTF-8

On 8 April 2012 11:56, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 04/08/2012 10:45 AM, Timon Gehr wrote:

 That actually sounds like a pretty awesome idea.

Make sure that the compiler does not actually rely on the fact that the template generates a function. The design should include the possibility of just generating tables. It all should be completely transparent to the compiler, if that is possible.

This sounds important to me. If it is also possible to do the work with generated tables, and not calling thousands of indirect functions in someone's implementation, it would be nice to reserve that possibility. Indirect function calls in hot loops make me very nervous for non-x86 machines. --002354470cb840c03604bd27b6bb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 8 April 2012 11:56, Timon Gehr <span dir=3D"l= tr">&lt;<a href=3D"mailto:timon.gehr gmx.ch">timon.gehr gmx.ch</a>&gt;</spa= n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b= order-left:1px #ccc solid;padding-left:1ex"> <div class=3D"im">On 04/08/2012 10:45 AM, Timon Gehr wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> That actually sounds like a pretty awesome idea.<br> </blockquote> <br></div> Make sure that the compiler does not actually rely on the fact that the tem= plate generates a function. The design should include the possibility of ju= st generating tables. It all should be completely transparent to the compil= er, if that is possible.<br> </blockquote></div><br><div>This sounds important to me. If it is also poss= ible to do the work with generated tables, and not calling thousands of ind= irect functions in someone&#39;s implementation, it would be nice to reserv= e that possibility.</div> <div>Indirect function calls in hot loops make me very nervous for non-x86 = machines.</div> --002354470cb840c03604bd27b6bb--
Apr 08 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--f46d043892abe2a8da04bd433c49
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 9 April 2012 21:20, deadalnix <deadalnix gmail.com> wrote:

 Le 08/04/2012 14:02, Alex R=C3=B8nne Petersen a =C3=A9crit :

  On 08-04-2012 11:42, Manu wrote:
 On 8 April 2012 11:56, Timon Gehr <timon.gehr gmx.ch
 <mailto:timon.gehr gmx.ch>> wrote:

 On 04/08/2012 10:45 AM, Timon Gehr wrote:

 That actually sounds like a pretty awesome idea.


 Make sure that the compiler does not actually rely on the fact that
 the template generates a function. The design should include the
 possibility of just generating tables. It all should be completely
 transparent to the compiler, if that is possible.


 This sounds important to me. If it is also possible to do the work with
 generated tables, and not calling thousands of indirect functions in
 someone's implementation, it would be nice to reserve that possibility.
 Indirect function calls in hot loops make me very nervous for non-x86
 machines.

Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywhere else, it's really not what you want in a GC.

functions, when things are predictable. It avoid many indirect calls, and purely by lib, which is good (can be tuned for application/plateform).

Eh? Not sure what you mean. The idea is the template would produce a struct/table of data instead of being a pointer to a function, this way the GC could work without calling anything. If the GC was written to assume GC info in a particular format/structure, it could be written without any calls. I'm just saying to leave that as a possibility, and not REQUIRE an indirect function call for every single allocation in the system. Some GC might be able to make better use of that sort of setup. --f46d043892abe2a8da04bd433c49 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 9 April 2012 21:20, deadalnix <span dir=3D"lt= r">&lt;<a href=3D"mailto:deadalnix gmail.com">deadalnix gmail.com</a>&gt;</= span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e= x;border-left:1px #ccc solid;padding-left:1ex"> Le 08/04/2012 14:02, Alex R=C3=B8nne Petersen a =C3=A9crit :<div><div class= =3D"h5"><br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> On 08-04-2012 11:42, Manu wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> On 8 April 2012 11:56, Timon Gehr &lt;<a href=3D"mailto:timon.gehr gmx.ch" = target=3D"_blank">timon.gehr gmx.ch</a><br> &lt;mailto:<a href=3D"mailto:timon.gehr gmx.ch" target=3D"_blank">timon.geh= r gmx.ch</a>&gt;&gt; wrote:<br> <br> On 04/08/2012 10:45 AM, Timon Gehr wrote:<br> <br> That actually sounds like a pretty awesome idea.<br> <br> <br> Make sure that the compiler does not actually rely on the fact that<br> the template generates a function. The design should include the<br> possibility of just generating tables. It all should be completely<br> transparent to the compiler, if that is possible.<br> <br> <br> This sounds important to me. If it is also possible to do the work with<br> generated tables, and not calling thousands of indirect functions in<br> someone&#39;s implementation, it would be nice to reserve that possibility.= <br> Indirect function calls in hot loops make me very nervous for non-x86<br> machines.<br> </blockquote> <br> Yes, I agree here. The last thing we need is a huge amount of<br> kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine<br> on x86, but anywhere else, it&#39;s really not what you want in a GC.<br> <br> </blockquote> <br></div></div> Nothing prevent the generated function to itself call other generated funct= ions, when things are predictable. It avoid many indirect calls, and purely= by lib, which is good (can be tuned for application/plateform).<br> </blockquote></div><br><div>Eh?</div><div>Not sure what you mean. The idea = is the template would produce a struct/table of data instead of being a poi= nter to a function, this way the GC could work without calling anything. If= the GC was written to assume GC info in a particular format/structure, it = could be written without any calls.</div> <div>I&#39;m just saying to leave that as a possibility, and not REQUIRE an= indirect function call for every single allocation in the system. Some GC = might be able to make better use of that sort of setup.</div> --f46d043892abe2a8da04bd433c49--
Apr 09 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--00248c711a11bc8d7e04bd46ae9a
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 10 April 2012 00:06, deadalnix <deadalnix gmail.com> wrote:

 Le 09/04/2012 20:33, Manu a =C3=A9crit :

  Eh?
 Not sure what you mean. The idea is the template would produce a
 struct/table of data instead of being a pointer to a function, this way
 the GC could work without calling anything. If the GC was written to
 assume GC info in a particular format/structure, it could be written
 without any calls.
 I'm just saying to leave that as a possibility, and not REQUIRE an
 indirect function call for every single allocation in the system. Some
 GC might be able to make better use of that sort of setup.

If you have reference to objects, you can't avoid a function call. If you have something you know at compile time, the generated function can directly call the other function that mark the pointed data (or even can =

 it itself, if you don't fear code bloat) without going back to the GC and
 its indirect call.

 So it make no difference in the number of indirect calls you have, but th=

 struct proposal is a stronger constraint on the GC that the function one.

 BTW, starting you answer by =C2=AB Not sure what you mean. =C2=BB should =

 red flag.

It is, and I still don't follow. I can't imagine there are any indirect function calls, except for the ones introduced by this proposal, where you may register a function to mark the pointers in complex structs. You seem to be suggesting that another one already exists anyway? Where is it? Why is it there? --00248c711a11bc8d7e04bd46ae9a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 10 April 2012 00:06, deadalnix <span dir=3D"l= tr">&lt;<a href=3D"mailto:deadalnix gmail.com">deadalnix gmail.com</a>&gt;<= /span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8= ex;border-left:1px #ccc solid;padding-left:1ex"> Le 09/04/2012 20:33, Manu a =C3=A9crit :<div class=3D"im"><br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Eh?<br> Not sure what you mean. The idea is the template would produce a<br> struct/table of data instead of being a pointer to a function, this way<br> the GC could work without calling anything. If the GC was written to<br> assume GC info in a particular format/structure, it could be written<br> without any calls.<br> I&#39;m just saying to leave that as a possibility, and not REQUIRE an<br> indirect function call for every single allocation in the system. Some<br> GC might be able to make better use of that sort of setup.<br> </blockquote> <br></div> If you have reference to objects, you can&#39;t avoid a function call. If y= ou have something you know at compile time, the generated function can dire= ctly call the other function that mark the pointed data (or even can do it = itself, if you don&#39;t fear code bloat) without going back to the GC and = its indirect call.<br> <br> So it make no difference in the number of indirect calls you have, but the = struct proposal is a stronger constraint on the GC that the function one.<b= r> <br> BTW, starting you answer by =C2=AB Not sure what you mean. =C2=BB should ha= ve been a red flag.<br></blockquote><div><br></div><div>It is, and I still = don&#39;t follow. I can&#39;t imagine there are any indirect function calls= , except for the ones introduced by this proposal, where you may register a= function to mark the pointers in complex structs.</div> <div>You seem to be suggesting that another one already exists anyway? Wher= e is it? Why is it there?</div></div> --00248c711a11bc8d7e04bd46ae9a--
Apr 09 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Sunday, 8 April 2012 at 12:02:10 UTC, Alex Rønne Petersen 
wrote:
 This sounds important to me. If it is also possible to do the 
 work with
 generated tables, and not calling thousands of indirect 
 functions in
 someone's implementation, it would be nice to reserve that 
 possibility.
 Indirect function calls in hot loops make me very nervous for 
 non-x86
 machines.

Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywhere else, it's really not what you want in a GC.

What's the problem with virtual calls on ARM?
Apr 13 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--0016e6d9a1f9a7816a04bd8fcf68
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 13 April 2012 15:53, Kagamin <spam here.lot> wrote:

 On Sunday, 8 April 2012 at 12:02:10 UTC, Alex R=C3=B8nne Petersen wrote:

 This sounds important to me. If it is also possible to do the work with
 generated tables, and not calling thousands of indirect functions in
 someone's implementation, it would be nice to reserve that possibility.
 Indirect function calls in hot loops make me very nervous for non-x86
 machines.

Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-virtual function calls on ARM, MIPS, etc. It may work fine o=


 x86, but anywhere else, it's really not what you want in a GC.

What's the problem with virtual calls on ARM?

No other processors have branch prediction units anywhere near the sophistication of modern x86. Any call through a function pointer stalls the pipeline, pipelines are getting longer all the time, and PPC has even more associated costs/hazards. Most processors can only perform trivial binary branch prediction around an 'if'. It also places burden on the icache (unable to prefetch), and of course the dcache, both of which are much less sophisticated than x86 aswell. Compiler can't do anything with code locality (improves icache usage), since the target is unknown at compile time... there are also pipeline stalls introduced by the sequence of indirect pointer lookups preceding any virtual call. Virtuals are possibly the worst hazard to modern CPU's, and the hardest to detect/profile, since their cost is evenly spread throughout the entire code base, you can never gauge their true impact on your performance. You also can't easily measure the affect of icache misses on your code, suffice to say, you will have MANY more in virtual-heavy code. While I'm at it. 'final:' and 'virtual' keyword please ;) --0016e6d9a1f9a7816a04bd8fcf68 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 13 April 2012 15:53, Kagamin <span dir=3D"ltr= ">&lt;spam here.lot&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di= v class=3D"im"> On Sunday, 8 April 2012 at 12:02:10 UTC, Alex R=C3=B8nne Petersen wrote:<br=

x #ccc solid;padding-left:1ex"><blockquote class=3D"gmail_quote" style=3D"m= argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> This sounds important to me. If it is also possible to do the work with<br> generated tables, and not calling thousands of indirect functions in<br> someone&#39;s implementation, it would be nice to reserve that possibility.= <br> Indirect function calls in hot loops make me very nervous for non-x86<br> machines.<br> </blockquote> <br> Yes, I agree here. The last thing we need is a huge amount of kinda-sorta-v= irtual function calls on ARM, MIPS, etc. It may work fine on x86, but anywh= ere else, it&#39;s really not what you want in a GC.<br> </blockquote> <br></div> What&#39;s the problem with virtual calls on ARM?<br> </blockquote></div><br><div>No other processors have branch prediction unit= s anywhere near the=C2=A0sophistication=C2=A0of modern x86. Any call throug= h a function pointer stalls the pipeline, pipelines are getting longer all = the time, and PPC has even more associated costs/hazards.</div> <div>Most processors can only perform trivial binary branch prediction arou= nd an &#39;if&#39;.</div><div><div>It also places burden on the icache (una= ble to prefetch), and of course the dcache, both of which are much less sop= histicated than x86 aswell.</div> </div><div>Compiler can&#39;t do anything with code locality (improves icac= he usage), since the target is unknown at compile time... there are also pi= peline stalls introduced by the sequence of indirect pointer lookups=C2=A0p= receding=C2=A0any virtual call.</div> <div>Virtuals are possibly the worst hazard to modern CPU&#39;s, and the ha= rdest to detect/profile, since their cost is evenly spread throughout the e= ntire code base, you can never gauge their true impact on your performance.= You also can&#39;t easily measure the affect of icache misses on your code= , suffice to say, you will have MANY more in virtual-heavy code.</div> <div><br></div><div>While I&#39;m at it. &#39;final:&#39; and &#39;virtual&= #39; keyword please ;)</div> --0016e6d9a1f9a7816a04bd8fcf68--
Apr 13 2012
prev sibling next sibling parent James Miller <james aatch.net> writes:
On 2012-04-13 16:54:28 +0300 Manu <turkeyman gmail.com> wrote:
 While I'm at it. 'final:' and 'virtual' keyword please ;)

Hmmm, I thought we decided that was a good idea, anybody in the know if this going to happen or not? -- James Miller
Apr 13 2012
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Friday, 13 April 2012 at 13:54:39 UTC, Manu wrote:
 No other processors have branch prediction units anywhere near
 the sophistication of modern x86. Any call through a function 
 pointer
 stalls the pipeline, pipelines are getting longer all the time, 
 and PPC has
 even more associated costs/hazards.
 Most processors can only perform trivial binary branch 
 prediction around an
 'if'.
 It also places burden on the icache (unable to prefetch), and 
 of course the
 dcache, both of which are much less sophisticated than x86 
 aswell.

Allocation of small aggregated objects usually involves allocation of several equally small objects of different types in a row, so they sit one after another in heap and gc will visit them in a row every time calling function different from the previous time, so to x86 processor it would result in constant misprediction: AFAIK x86 processor caches only one target address per branch (ARM caches a flag?). And icache should not suffer in both cases: once you prefetched the function, it will remain in the icache and be reused from there the next time.
Apr 13 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--00248c76910a87a1b804bda0f241
Content-Type: text/plain; charset=UTF-8

On 13 April 2012 17:25, Kagamin <spam here.lot> wrote:
 once you prefetched the function, it will remain in the icache and be
 reused from there the next time.

All depends how much you love object orientation. If you follow the C++ book and make loads of classes for everything, you'll thrash the hell out of it. If you only have a couple of different object, maybe they'll coexist in the icache. The GC is a bit of a special case though because it runs in a tight loop. That said, the pipeline hazards still exist regardless of the state of icache. Conventional virtuals are worse, since during the process of executing regular code, there's not usually such a tight loop pattern. (note: I was answering the prior question about virtual functions in general, not as applied to the specific use case of a GC scan) The latest 'turbo' ARM chips (Cortex, etc) and such may store a branch target table, they are alleged to have improved prediction, but I haven't checked. Prior chips (standard Arm9 and down, note: most non-top-end androids fall in this category, and all games consoles with arms in them) don't technically have branch prediction at all. ARM has conditional execution bits on instructions, so it can filter opcodes based on the state of the flags register. This is a cool design for binary branching, or performing 'select' operations, but it can't do anything to help an indirect jump. Point is, the GC is the most fundamental performance hazard to D, and I just think it's important to make sure the design is such that it is possible to write a GC loop which can do its job with generated data tables if possible, instead of requiring generated marker functions. It would seem to me that carefully crafted tables of data could technically perform the same function as marker functions, but without any function calls... and the performance characteristics may be better/worse for different architectures. --00248c76910a87a1b804bda0f241 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 13 April 2012 17:25, Kagamin <span dir=3D"ltr= ">&lt;spam here.lot&gt;</span> wrote:<blockquote class=3D"gmail_quote" styl= e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">once yo= u prefetched the function, it will remain in the icache and be reused from = there the next time.<br> </blockquote><div><br></div><div>All depends how much you love object orien= tation. If you follow the C++ book and make loads of classes for everything= , you&#39;ll thrash the hell out of it. If you only have a couple of differ= ent object, maybe they&#39;ll coexist in the icache.</div> <div>The GC is a bit of a special case though because it runs in a tight lo= op. That said, the pipeline hazards still exist regardless of the state of = icache.</div><div>Conventional virtuals are worse, since during the process= of executing regular code, there&#39;s not usually such a tight loop patte= rn.</div> <div>(note: I was answering the prior question about virtual functions in g= eneral, not as applied to the specific use case of a GC scan)</div><div><br=
</div><div>The latest &#39;turbo&#39; ARM chips (Cortex, etc) and such may=

but I haven&#39;t checked.</div> <div>Prior chips (standard Arm9 and down, note: most non-top-end androids f= all in this category, and all games consoles with arms in them) don&#39;t t= echnically have branch prediction at all. ARM has conditional execution bit= s on instructions, so it can filter opcodes based on the state of the flags= register. This is a cool design for binary branching, or performing &#39;s= elect&#39; operations, but it can&#39;t do anything to help an indirect jum= p.</div> <div><br></div><div>Point is, the GC is the most fundamental performance ha= zard to D, and I just think it&#39;s important to make sure the design is s= uch that it is possible to write a GC loop which can do its job with genera= ted data tables if possible, instead of requiring generated marker function= s.</div> <div>It would seem to me that carefully crafted tables of data could techni= cally perform the same function as marker functions, but without any functi= on calls... and the performance characteristics may be better/worse for dif= ferent architectures.</div> </div> --00248c76910a87a1b804bda0f241--
Apr 14 2012
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 14 Apr 2012 05:21:08 -0500, Manu <turkeyman gmail.com> wrote:

 On 13 April 2012 17:25, Kagamin <spam here.lot> wrote:
 once you prefetched the function, it will remain in the icache and be
 reused from there the next time.

All depends how much you love object orientation. If you follow the C++ book and make loads of classes for everything, you'll thrash the hell out of it. If you only have a couple of different object, maybe they'll coexist in the icache. The GC is a bit of a special case though because it runs in a tight loop. That said, the pipeline hazards still exist regardless of the state of icache. Conventional virtuals are worse, since during the process of executing regular code, there's not usually such a tight loop pattern. (note: I was answering the prior question about virtual functions in general, not as applied to the specific use case of a GC scan) The latest 'turbo' ARM chips (Cortex, etc) and such may store a branch target table, they are alleged to have improved prediction, but I haven't checked. Prior chips (standard Arm9 and down, note: most non-top-end androids fall in this category, and all games consoles with arms in them) don't technically have branch prediction at all. ARM has conditional execution bits on instructions, so it can filter opcodes based on the state of the flags register. This is a cool design for binary branching, or performing 'select' operations, but it can't do anything to help an indirect jump. Point is, the GC is the most fundamental performance hazard to D, and I just think it's important to make sure the design is such that it is possible to write a GC loop which can do its job with generated data tables if possible, instead of requiring generated marker functions. It would seem to me that carefully crafted tables of data could technically perform the same function as marker functions, but without any function calls... and the performance characteristics may be better/worse for different architectures.

Is there any reason to assume that the GC is actually using the mark function for marking? The mark function is only there to provide the GC with the information it needs in order to mark. What it actually is is defined by the runtime. So it may just return a bitmask. Or it might _be_ a bitmask (for objects <= 512 bytes) Or an 64-bit index into some internal table. Etc. If fact, real functions only start making sense with very large arrays and compound structs/objects. My actual concern with this approach is that a DLL compiled with GC A will have problems interacting with a program with GC B.
Apr 14 2012
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 07 Apr 2012 21:56:09 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 Of course, many of us have been thinking about this for a looong time,  
 and what is the best way to go about it. The usual technique is for the  
 compiler to emit some sort of table for each TypeInfo giving the layout  
 of the object, i.e. where the pointers are.

 The general problem with these is the table is non-trivial, as it will  
 require things like iterated data blocks, etc. It has to be compressed  
 to save space, and the gc then has to execute a fair amount of code to  
 decode it.

 It also requires some significant work on the compiler end, leading of  
 course to complexity, rigidity, development bottlenecks, and the usual  
 bugs.

 An alternative Andrei and I have been talking about is to put in the  
 TypeInfo a pointer to a function. That function will contain customized  
 code to mark the pointers in an instance of that type. That custom code  
 will be generated by a template defined by the library. All the compiler  
 has to do is stupidly instantiate the template for the type, and insert  
 an address to the generated function.

 The compiler need know NOTHING about how the marking works.

 Even better, as ctRegex has demonstrated, the custom generated code can  
 be very, very fast compared with a runtime table-driven approach. (The  
 slow part will be calling the function indirectly.)

 And best of all, the design is pushed out of the compiler into the  
 library, so various schemes can be tried out without needing compiler  
 work.

 I think this is an exciting idea, it will enable us to get a precise gc  
 by enabling people to work on it in parallel rather than serially  
 waiting for me.

I think this is a really good idea. I would like to go further and propose that there be an arbitrary way to add members to the TypeInfo types using templates. Not sure how it would be implemented, but I don't see why this has to be specific to GCs. Some way to signify "hey compiler, please initialize this member with template X given the type being compiled". This could be a huge bridge between compile-time and runtime type information. -Steve
Apr 09 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 08/04/2012 03:56, Walter Bright a crit :
 Of course, many of us have been thinking about this for a looong time,
 and what is the best way to go about it. The usual technique is for the
 compiler to emit some sort of table for each TypeInfo giving the layout
 of the object, i.e. where the pointers are.

 The general problem with these is the table is non-trivial, as it will
 require things like iterated data blocks, etc. It has to be compressed
 to save space, and the gc then has to execute a fair amount of code to
 decode it.

 It also requires some significant work on the compiler end, leading of
 course to complexity, rigidity, development bottlenecks, and the usual
 bugs.

 An alternative Andrei and I have been talking about is to put in the
 TypeInfo a pointer to a function. That function will contain customized
 code to mark the pointers in an instance of that type. That custom code
 will be generated by a template defined by the library. All the compiler
 has to do is stupidly instantiate the template for the type, and insert
 an address to the generated function.

 The compiler need know NOTHING about how the marking works.

 Even better, as ctRegex has demonstrated, the custom generated code can
 be very, very fast compared with a runtime table-driven approach. (The
 slow part will be calling the function indirectly.)

 And best of all, the design is pushed out of the compiler into the
 library, so various schemes can be tried out without needing compiler work.

 I think this is an exciting idea, it will enable us to get a precise gc
 by enabling people to work on it in parallel rather than serially
 waiting for me.

This id a good idea. However, this doesn't handle type qualifiers. And this is important ! D2 type system is made in such a way that most data are either thread local or immutable, and a small amount is shared. Both thread local storage and immutability are source of BIG improvement for the GC. Doing without it is a huge design error. For instance, Ocaml's GC is known to be more performant than Java's. Because in Caml, most data are immutable, and the GC take advantage of this. Immutability means 100% concurrent garbage collection. In the other hand, TLS can be collected independently and only influence the thread that own the data. Both are every powerfull improvement, and the design you propose as this cannot provide any mean to handle that. Which is a big missed opportunity, and will be hard to change in the future.
Apr 09 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/9/2012 11:30 AM, deadalnix wrote:
 In the other hand, TLS can be collected independently and only influence the
 thread that own the data. Both are every powerfull improvement, and the design
 you propose  as this  cannot provide any mean to handle that. Which is a big
 missed opportunity, and will be hard to change in the future.

I think this is an orthogonal issue.
Apr 09 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 09/04/2012 23:27, Walter Bright a crit :
 On 4/9/2012 11:30 AM, deadalnix wrote:
 In the other hand, TLS can be collected independently and only
 influence the
 thread that own the data. Both are every powerfull improvement, and
 the design
 you propose  as this  cannot provide any mean to handle that. Which
 is a big
 missed opportunity, and will be hard to change in the future.

I think this is an orthogonal issue.

You mean an allocator/deallocator one ? I'm not sure. For instance, concurrent shared memory scanning will require some magic on reference changes (it can be hooked into the program using page protection). In such a case, you have constraint in what the scanning function can do or can't. If the function is scanning immutable data, such a constraint disappears. In a similar way, when scanning TLS, you'll want to avoid going into non TLS world. This is currently possible only of you go back to main GC code and trigger the indirect call every single time you encounter a pointer or a reference. This is going to be a performance killer on many architecture. So this code, in a way or another will need to be aware of the qualifier. Or it will either require to pass every single pointer/reference into an indirect function call, or forget about optimizations that the type system has been made to allow (in the program in general, not especially in the GC).
Apr 10 2012