www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [xmlp] the recent garbage collector performance improvements

reply Richard Webb <webby beardmouse.org.uk> writes:
Last night I tried loading a ~20 megabyte xml file using xmlp (via the
DocumentBuilder.LoadFile function) and a recent dmd build, and found that it
took ~48 seconds to complete, which is rather poor.
I tried running it through a profiler, and that said that almost all the
runtime was spent inside the garbage collector.

I then tried the same test using the latest Git versions of dmd/druntime (with
pull request 108 merged in), and that took less than 10 seconds. This is a
rather nice improvement, though still somewhat on the slow side.

Some profiler numbers, if anyone is interested:

Old version:
Gcxfullcollect: 31.14 seconds, 69.26% runtime.
Gcxmark: 4.84 seconds, 10.77% runtime.
Gcxfindpool: 2.10 seconds, 4.67% runtime.

New version:
Gcxmark: 11.67 seconds, 50.77% runtime.
Gcxfindpool: 3.58 seconds, 15.55% runtime.
Gcxfullcollect: 1.69 seconds, 7.37% runtime.

(Assuming that Sleepy is giving me accurate numbers. The new version is
definately faster though).
Feb 01 2012
next sibling parent reply "dsimcha" <dsimcha yahoo.com> writes:
Interesting.  I'm glad my improvements seem to matter in the real 
world, though I'm thoroughly impressed with the amount of 
speedup.  Even the small allocation benchmark that I was 
optimizing only sped up by ~50% from 2.057 to 2.058 overall and 
~2x in collection time.  I'd be very interested if you could make 
a small, self-contained test program to use as a benchmark.

GC performance is one of D's biggest weak spots, so it would 
probably be a good bit of marketing to show that the performance 
is substantially better than it used to be even if it's not great 
yet.  Over the past year I've been working on and off at speeding 
it up.  It's now at least ~2x faster than it was last year at 
this time on every benchmark I've tried and up to several hundred 
times faster in the extreme case of huge allocations.

On Wednesday, 1 February 2012 at 18:33:58 UTC, Richard Webb wrote:
 Last night I tried loading a ~20 megabyte xml file using xmlp 
 (via the
 DocumentBuilder.LoadFile function) and a recent dmd build, and 
 found that it
 took ~48 seconds to complete, which is rather poor.
 I tried running it through a profiler, and that said that 
 almost all the
 runtime was spent inside the garbage collector.

 I then tried the same test using the latest Git versions of 
 dmd/druntime (with
 pull request 108 merged in), and that took less than 10 
 seconds. This is a
 rather nice improvement, though still somewhat on the slow side.

 Some profiler numbers, if anyone is interested:

 Old version:
 Gcxfullcollect: 31.14 seconds, 69.26% runtime.
 Gcxmark: 4.84 seconds, 10.77% runtime.
 Gcxfindpool: 2.10 seconds, 4.67% runtime.

 New version:
 Gcxmark: 11.67 seconds, 50.77% runtime.
 Gcxfindpool: 3.58 seconds, 15.55% runtime.
 Gcxfullcollect: 1.69 seconds, 7.37% runtime.

 (Assuming that Sleepy is giving me accurate numbers. The new 
 version is
 definately faster though).
Feb 01 2012
next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
XML is a prime example of a shallow tree structure,
so it's the prime target for the recent optimization.
Feb 01 2012
prev sibling parent reply Richard Webb <webby beardmouse.org.uk> writes:
On 01/02/2012 19:35, dsimcha wrote:
 I'd be very
 interested if you could make a small, self-contained test program to use
 as a benchmark.
The 'test' is just ///////////////// import std.xml2; void main() { string xmlPath = r"test.xml"; auto document = DocumentBuilder.LoadFile(xmlPath, false, false); } ///////////////// It's xmlp that does all the work (and takes all the time). I'll see about generating a simple test file, but basically: 50000 top level nodes each one has 6 child nodes each node has a single attribute, and the child nodes each have a short text value. Parsing the file with DMD 2.057 takes ~25 seconds Parsing the file with DMD 2.058(Git) takes ~6.1 seconds Parsing the file with DMD 2.058, with the GC disabled during the LoadFile call, takes ~2.2 seconds. For comparison, MSXML6 takes 1.6 seconds to load the same file.
Feb 01 2012
next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 02.02.2012, 01:41 Uhr, schrieb Richard Webb <webby beardmouse.org.uk>:

 Parsing the file with DMD 2.057 takes ~25 seconds

 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds

 Parsing the file with DMD 2.058, with the GC disabled during the  
 LoadFile call, takes ~2.2 seconds.


 For comparison, MSXML6 takes 1.6 seconds to load the same file.
Speaking of which, why not also compare the memory consumption (peak working set size)? Memory vs. CPU is the typical trade-off, so it might be interesting from that point, but I also wonder what the overhead for GC managed memory is - assuming that MSXML6 uses only ref counting. And if it is a whole lot more (like >+50%), what methods could apply, that 'waste' less space. This API function should get the job done: GetProcessMemoryInfo http://msdn.microsoft.com/en-us/library/windows/desktop/ms683219(v=vs.85).aspx
Feb 01 2012
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Richard Webb:

 Parsing the file with DMD 2.057 takes ~25 seconds
 
 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds
 
 Parsing the file with DMD 2.058, with the GC disabled during the 
 LoadFile call, takes ~2.2 seconds.
 
 
 For comparison, MSXML6 takes 1.6 seconds to load the same file.
Not too much time ago Python devs have added an heuristic to the Python GC (that is a reference counter + cycle breaker), it "switches off" if it detects the program is allocating many items in a short time. Is it possible to add something similar to the D GC? Bye, bearophile
Feb 01 2012
parent "dsimcha" <dsimcha yahoo.com> writes:
On Thursday, 2 February 2012 at 01:27:44 UTC, bearophile wrote:
 Richard Webb:

 Parsing the file with DMD 2.057 takes ~25 seconds
 
 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds
 
 Parsing the file with DMD 2.058, with the GC disabled during 
 the LoadFile call, takes ~2.2 seconds.
 
 
 For comparison, MSXML6 takes 1.6 seconds to load the same file.
Not too much time ago Python devs have added an heuristic to the Python GC (that is a reference counter + cycle breaker), it "switches off" if it detects the program is allocating many items in a short time. Is it possible to add something similar to the D GC? Bye, bearophile
I actually tried to add something like this a while back but I couldn't find a heuristic that worked reasonably well. The idea was just to create a timeout where the GC can't run for x milliseconds after it just ran.
Feb 01 2012
prev sibling parent reply "dsimcha" <dsimcha yahoo.com> writes:
Wait a minute, since when do we even have a std.xml2?  I've never 
heard of it and it's not in the Phobos source tree (I just 
checked).

On Thursday, 2 February 2012 at 00:41:31 UTC, Richard Webb wrote:
 On 01/02/2012 19:35, dsimcha wrote:
 I'd be very
 interested if you could make a small, self-contained test 
 program to use
 as a benchmark.
The 'test' is just ///////////////// import std.xml2; void main() { string xmlPath = r"test.xml"; auto document = DocumentBuilder.LoadFile(xmlPath, false, false); } ///////////////// It's xmlp that does all the work (and takes all the time). I'll see about generating a simple test file, but basically: 50000 top level nodes each one has 6 child nodes each node has a single attribute, and the child nodes each have a short text value. Parsing the file with DMD 2.057 takes ~25 seconds Parsing the file with DMD 2.058(Git) takes ~6.1 seconds Parsing the file with DMD 2.058, with the GC disabled during the LoadFile call, takes ~2.2 seconds. For comparison, MSXML6 takes 1.6 seconds to load the same file.
Feb 01 2012
next sibling parent reply "a" <a a.com> writes:
It looks like someone wrote it a year ago, but it was never added 
to phobos: 
http://www.digitalmars.com/d/archives/digitalmars/D/announce/std.xml2_
andidate_19804.html 
.

On Thursday, 2 February 2012 at 04:39:11 UTC, dsimcha wrote:
 Wait a minute, since when do we even have a std.xml2?  I've 
 never heard of it and it's not in the Phobos source tree (I 
 just checked).

 On Thursday, 2 February 2012 at 00:41:31 UTC, Richard Webb 
 wrote:
 On 01/02/2012 19:35, dsimcha wrote:
 I'd be very
 interested if you could make a small, self-contained test 
 program to use
 as a benchmark.
The 'test' is just ///////////////// import std.xml2; void main() { string xmlPath = r"test.xml"; auto document = DocumentBuilder.LoadFile(xmlPath, false, false); } ///////////////// It's xmlp that does all the work (and takes all the time). I'll see about generating a simple test file, but basically: 50000 top level nodes each one has 6 child nodes each node has a single attribute, and the child nodes each have a short text value. Parsing the file with DMD 2.057 takes ~25 seconds Parsing the file with DMD 2.058(Git) takes ~6.1 seconds Parsing the file with DMD 2.058, with the GC disabled during the LoadFile call, takes ~2.2 seconds. For comparison, MSXML6 takes 1.6 seconds to load the same file.
Feb 01 2012
parent Richard Webb <richard.webb boldonjames.com> writes:
On 02/02/2012 04:53, a wrote:
 It looks like someone wrote it a year ago, but it was never added to
 phobos:
 http://www.digitalmars.com/d/archives/digitalmars/D/announce/std.xml2_candidate_19804.html
 .
Thats the one -> http://www.dsource.org/projects/xmlp/ (the modules are under std. ).
Feb 02 2012
prev sibling parent reply "Jesse Phillips" <jessekphillips+D gmail.com> writes:
On Thursday, 2 February 2012 at 04:39:11 UTC, dsimcha wrote:
 Wait a minute, since when do we even have a std.xml2?  I've 
 never heard of it and it's not in the Phobos source tree (I 
 just checked).
As others point out it is xmlp, I've been using it for the last year or two, and recently seen an improvement on performance (as mentioned). I don't know how comparable it is, and Richard seems to see a slow down from GC, for me disabling the GC during load doesn't change load time, but I'm not using the document loader. I hope Michael will put some effort into getting it ready for review. But still, people should use it and get him some incentive to do the work.
Feb 02 2012
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, February 02, 2012 18:11:34 Jesse Phillips wrote:
 On Thursday, 2 February 2012 at 04:39:11 UTC, dsimcha wrote:
 Wait a minute, since when do we even have a std.xml2? I've
 never heard of it and it's not in the Phobos source tree (I
 just checked).
As others point out it is xmlp, I've been using it for the last year or two, and recently seen an improvement on performance (as mentioned). I don't know how comparable it is, and Richard seems to see a slow down from GC, for me disabling the GC during load doesn't change load time, but I'm not using the document loader. I hope Michael will put some effort into getting it ready for review. But still, people should use it and get him some incentive to do the work.
In theory, Tomek Sowiński was working on a replacement for std.xml which was the likely replacement for std.xml, but he seems to have disappeared... His last post was on an XML writer back in June. So, I don't know what the current situation with that is. - Jonathan M Davis
Feb 02 2012
prev sibling parent reply Richard Webb <webby beardmouse.org.uk> writes:
On 02/02/2012 17:11, Jesse Phillips wrote:
 for me disabling the GC during load doesn't change load time,
 but I'm not using the document loader.
The GC hit is related to the number of dom nodes that exist at one i think -> the visitor approach doesnt allocate the whole dom tree, so there are far fewer items (and less allocated memory for the gc to scan). For comparison, parsing my test file using XmlVisitor takes less than 3 seconds (over twice as fast as the DOM version). I looked in to it a bit and found this: http://www.dsource.org/projects/xmlp/ticket/10 Seems that it's calling GC.qmalloc 350000+ times, mostly for 16byte blocks, when normalizing attributes. This doesn't seem hugely clever :-(
Feb 02 2012
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
Speaking of GC improvements, I googled around a bit and found this
thread from 2 years ago:

	http://www.digitalmars.com/d/archives/digitalmars/D/Is_there_a_modern_GC_for_D_105967.html

Especially interesting is this paper:

	Nonintrusive Cloning Garbage Collector with Stock Operating
	System Support.
	        http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

Has anything been done along these lines since that time? Seems like
this particular GC algorithm is exactly the kind we need for D. It's a
conservative mark-and-sweep algo with a very low overhead(*), mark phase
concurrent with mutator thread(s), and lazy incremental sweeping at
allocation time.  Synchronization is automatically done by default OS
kernel-space mechanisms (copy on write memory pages).

More to the point, how easy/hard is it to switch between GCs in the
current D implementation(s)? I think it would be helpful if these kinds
of experimental GCs were available in addition to the current default
GC, and people can play around with them and find out which one(s) are
the cream of the crop. Otherwise we're just bottlenecked at a small
number of people who can actually play with GC algos for D -- which
means improvements will be slow.


(*) True on *nix systems anyway, but judging from comments in
that thread, Windoze also was acquiring equivalent functionality -- and
this being 2 years ago, I'm assuming it's available now.


--T
Feb 02 2012
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Thu, 02 Feb 2012 19:31:50 -0600, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:

 Speaking of GC improvements, I googled around a bit and found this
 thread from 2 years ago:

 	http://www.digitalmars.com/d/archives/digitalmars/D/Is_there_a_modern_GC_for_D_105967.html

 Especially interesting is this paper:

 	Nonintrusive Cloning Garbage Collector with Stock Operating
 	System Support.
 	        http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

 Has anything been done along these lines since that time? Seems like
 this particular GC algorithm is exactly the kind we need for D. It's a
 conservative mark-and-sweep algo with a very low overhead(*), mark phase
 concurrent with mutator thread(s), and lazy incremental sweeping at
 allocation time.  Synchronization is automatically done by default OS
 kernel-space mechanisms (copy on write memory pages).

 More to the point, how easy/hard is it to switch between GCs in the
 current D implementation(s)? I think it would be helpful if these kinds
 of experimental GCs were available in addition to the current default
 GC, and people can play around with them and find out which one(s) are
 the cream of the crop. Otherwise we're just bottlenecked at a small
 number of people who can actually play with GC algos for D -- which
 means improvements will be slow.


 (*) True on *nix systems anyway, but judging from comments in
 that thread, Windoze also was acquiring equivalent functionality -- and
 this being 2 years ago, I'm assuming it's available now.
So to answer your question, yes, someone has made one of these types of GC for D called CDGC. No, it doesn't look like Windows will support this anytime soon. And cloning GCs, don't solve the problems of large heaps, soft/hard realtime issues like game render threads, and it actually makes the embarrassingly long pause (when they happen) longer. Now, CDGC, as I understand it, is better written than our current GC and would probably improve things, but I don't see it as the final end goal.
Feb 02 2012
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 02, 2012 at 08:48:20PM -0600, Robert Jacques wrote:
[...]
 So to answer your question, yes, someone has made one of these types
 of GC for D called CDGC. No, it doesn't look like Windows will support
 this anytime soon.  And cloning GCs, don't solve the problems of large
 heaps, soft/hard realtime issues like game render threads, and it
 actually makes the embarrassingly long pause (when they happen)
 longer. Now, CDGC, as I understand it, is better written than our
 current GC and would probably improve things, but I don't see it as
 the final end goal.
To me, the final goal should be to make the GC swappable, that is, there's a way to tell the compiler which GC you want to use. After I got convinced about the benefits of GC by the article on garbage collection in DPLO, I started reading up a bit about different GC algorithms, and my conclusion is that there is no such thing as a one-size-fits-all GC. Every GC algo out there is sensitive to the kind of allocations the program makes, and the memory/speed constraints you're running under. So the real solution to GC performance problems is to make it swappable. Today, for instance, I found a book that talks about a hard real-time GC, which is very different from the paper I mentioned earlier. The point is, if you want hard real-time guarantees, it should be possible to tell the compiler to use a GC that gives you that, and if you're OK with soft real-time, then you can tell the compiler to use something like CDGC. And if you're writing a batch-processing program you can use a GC with a long STW pause but better overall throughput. Anyway, I found the blog of the guy who implemented CDGC, and it seems that it did quite well performance-wise at the end. Apparently it's in an experimental D2 branch; anyone tried to use it recently? T -- Questions are the beginning of intelligence, but the fear of God is the beginning of wisdom.
Feb 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-02-03 05:01, H. S. Teoh wrote:
 On Thu, Feb 02, 2012 at 08:48:20PM -0600, Robert Jacques wrote:
 [...]
 So to answer your question, yes, someone has made one of these types
 of GC for D called CDGC. No, it doesn't look like Windows will support
 this anytime soon.  And cloning GCs, don't solve the problems of large
 heaps, soft/hard realtime issues like game render threads, and it
 actually makes the embarrassingly long pause (when they happen)
 longer. Now, CDGC, as I understand it, is better written than our
 current GC and would probably improve things, but I don't see it as
 the final end goal.
To me, the final goal should be to make the GC swappable, that is, there's a way to tell the compiler which GC you want to use. After I got convinced about the benefits of GC by the article on garbage collection in DPLO, I started reading up a bit about different GC algorithms, and my conclusion is that there is no such thing as a one-size-fits-all GC. Every GC algo out there is sensitive to the kind of allocations the program makes, and the memory/speed constraints you're running under. So the real solution to GC performance problems is to make it swappable. Today, for instance, I found a book that talks about a hard real-time GC, which is very different from the paper I mentioned earlier. The point is, if you want hard real-time guarantees, it should be possible to tell the compiler to use a GC that gives you that, and if you're OK with soft real-time, then you can tell the compiler to use something like CDGC. And if you're writing a batch-processing program you can use a GC with a long STW pause but better overall throughput. Anyway, I found the blog of the guy who implemented CDGC, and it seems that it did quite well performance-wise at the end. Apparently it's in an experimental D2 branch; anyone tried to use it recently? T
The GC is already swappable during linking. -- /Jacob Carlborg
Feb 03 2012
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Feb 03, 2012 at 09:30:36AM +0100, Jacob Carlborg wrote:
[...]
 The GC is already swappable during linking.
[...] Is there documentation for this? T -- No! I'm not in denial!
Feb 03 2012
prev sibling parent reply "Jesse Phillips" <jessekphillips+D gmail.com> writes:
On Wednesday, 1 February 2012 at 18:33:58 UTC, Richard Webb wrote:
 Last night I tried loading a ~20 megabyte xml file using xmlp 
 (via the
 DocumentBuilder.LoadFile function) and a recent dmd build, and 
 found that it
 took ~48 seconds to complete, which is rather poor.
 I tried running it through a profiler, and that said that 
 almost all the
 runtime was spent inside the garbage collector.
Interesting, I've been using XmlVisitor on two 30meg and one 10meg, load time for all files being 30secs, this is actually an improvement from an older version of xmlp which took 60secs. Not sure if DocumentBuilder would be much different.
Feb 01 2012
parent reply Richard Webb <webby beardmouse.org.uk> writes:
On 01/02/2012 20:02, Jesse Phillips wrote:
 Interesting, I've been using XmlVisitor on two 30meg and one 10meg, load
 time for all files being 30secs, this is actually an improvement from an
 older version of xmlp which took 60secs. Not sure if DocumentBuilder
 would be much different.
For reference, the file i was testing with has ~50000 root nodes, each of which has several children. The number of nodes seems to have a much larger effect on the speed that the amount of data. I tried it with an old version of KXML and that seems somewhat faster, although the latest version doesn't seem to build with 2.058 and it's a more basic library.
Feb 01 2012
parent reply "dsimcha" <dsimcha yahoo.com> writes:
On Wednesday, 1 February 2012 at 22:53:11 UTC, Richard Webb wrote:
 For reference, the file i was testing with has ~50000 root 
 nodes, each of which has several children.
 The number of nodes seems to have a much larger effect on the 
 speed that the amount of data.
Sounds about right. For very small allocations sweeping time dominates the total GC time. You can see the breakdown at https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . The Tree1 benchmark is the very small allocation benchmark. Sweeping takes time linear in the number of memory blocks allocated and, for blocks <1 page, constant time in the size of the blocks.
Feb 01 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 02, 2012 at 12:09:01AM +0100, dsimcha wrote:
 On Wednesday, 1 February 2012 at 22:53:11 UTC, Richard Webb wrote:
For reference, the file i was testing with has ~50000 root nodes,
each of which has several children.
The number of nodes seems to have a much larger effect on the
speed that the amount of data.
Sounds about right. For very small allocations sweeping time dominates the total GC time. You can see the breakdown at https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 . The Tree1 benchmark is the very small allocation benchmark. Sweeping takes time linear in the number of memory blocks allocated and, for blocks <1 page, constant time in the size of the blocks.
Out of curiosity, is there a way to optimize for the "many small allocations" case? E.g., if a function allocates, as temporary storage, a tree with a large number of nodes, which becomes garbage when it returns. Perhaps a way to sweep the entire space used by the tree in one go? Not sure if such a thing is possible. T -- Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin
Feb 01 2012
next sibling parent reply David Nadlinger <see klickverbot.at> writes:
On 2/2/12 12:44 AM, H. S. Teoh wrote:
 Out of curiosity, is there a way to optimize for the "many small
 allocations" case? E.g., if a function allocates, as temporary storage,
 a tree with a large number of nodes, which becomes garbage when it
 returns. Perhaps a way to sweep the entire space used by the tree in one
 go?
The easiest way would be to use a specialized allocator for this – I'm thinking of David Simcha's proposed RegionAllocator (formerly TempAlloc). David
Feb 01 2012
parent bearophile <bearophileHUGS lycos.com> writes:
dsimcha has done a good work on the GC. After this change I have seen that some
of my GC-heavy programs are just a bit slower, while most of them are
significantly faster, so overall it's a good improvement. In my opinion this is
one of the best things that DMD 2.058 brings (short lambda syntax is another
one of them, I am appreciating it a lot. Sometimes syntax sugar matters).

Still, even with dsimcha improvements, I think the current D GC is almost a toy
compared to some of the more advanced GCs that you are able to find around. So
is it right to work on small (but useful) incremental improvements on the
current GC, instead of working on creating a fully redesigned D GC (this is a
large work)? I don't know. But surely even incremental improvements are
welcome, because there are (or there were) enough low hanging fruits to pick.

I don't know how the current D GC performance scales to heaps of 5 or 30 Gbytes
size, composed of mostly small objects. From what I've seen this is a hard
problem even for almost-state-of-the-art GCs.


David Nadlinger:

 The easiest way would be to use a specialized allocator for this – I'm 
 thinking of David Simcha's proposed RegionAllocator (formerly TempAlloc).
I'd like a simple way to define a hierarchy of such allocations (and when you free a node of the tree, the whole sub-tree gets deallocated). Bye, bearophile
Feb 01 2012
prev sibling parent reply "dsimcha" <dsimcha yahoo.com> writes:
On Wednesday, 1 February 2012 at 23:43:24 UTC, H. S. Teoh wrote:
 Out of curiosity, is there a way to optimize for the "many small
 allocations" case? E.g., if a function allocates, as temporary 
 storage,
 a tree with a large number of nodes, which becomes garbage when 
 it
 returns. Perhaps a way to sweep the entire space used by the 
 tree in one
 go?

 Not sure if such a thing is possible.


 T
My RegionAllocator is probably the best thing for this if the lifetime is deterministic as you describe. I rewrote the Tree1 benchmark using RegionAllocator a while back just for comparison. D Tree1 + RegionAllocator had comparable speed to a Java version of Tree1 run under HotSpot. (About 6 seconds on my box vs. in the low 30s for Tree1 with the 2.058 GC.) If all the objects are going to die at the same time but not at a deterministic time, you could just allocate a big block from the GC and place class instances in it using emplace().
Feb 01 2012
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Wed, 01 Feb 2012 18:40:20 -0600, dsimcha <dsimcha yahoo.com> wrote:
 On Wednesday, 1 February 2012 at 23:43:24 UTC, H. S. Teoh wrote:
 Out of curiosity, is there a way to optimize for the "many small
 allocations" case? E.g., if a function allocates, as temporary
 storage,
 a tree with a large number of nodes, which becomes garbage when
 it
 returns. Perhaps a way to sweep the entire space used by the
 tree in one
 go?

 Not sure if such a thing is possible.


 T
My RegionAllocator is probably the best thing for this if the lifetime is deterministic as you describe. I rewrote the Tree1 benchmark using RegionAllocator a while back just for comparison. D Tree1 + RegionAllocator had comparable speed to a Java version of Tree1 run under HotSpot. (About 6 seconds on my box vs. in the low 30s for Tree1 with the 2.058 GC.) If all the objects are going to die at the same time but not at a deterministic time, you could just allocate a big block from the GC and place class instances in it using emplace().
An XML parser would probably want some kind of stack segment growth schedule, which, IIRC isn't supported by RegionAllocator.
Feb 01 2012
parent reply "dsimcha" <dsimcha yahoo.com> writes:
On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques 
wrote:
 An XML parser would probably want some kind of stack segment 
 growth schedule, which, IIRC isn't supported by RegionAllocator.
I had considered putting that in RegionAllocator but I was skeptical of the benefit, at least assuming we're targeting PCs and not embedded devices. The default segment size is 4MB. Trying to make the initial size any smaller won't save much memory. Four megabytes is also big enough that new segments would be allocated so infrequently that the cost would be negligible. I concluded that the added complexity wasn't justified.
Feb 02 2012
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:

 An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.
at least assuming we're targeting PCs and not embedded devices.
I don't know about the implications of your decision, but comment makes me feel uneasy. I don't know how you can possibly make that assumption? Have you looked around at the devices people actually use these days? PC's are an endangered and dying species... I couldn't imagine a worse assumption if it influences the application of D on different systems.
Feb 02 2012
parent reply "dsimcha" <dsimcha yahoo.com> writes:
On Thursday, 2 February 2012 at 18:06:24 UTC, Manu wrote:
 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques 
 wrote:

 An XML parser would probably want some kind of stack segment 
 growth
 schedule, which, IIRC isn't supported by RegionAllocator.
at least assuming we're targeting PCs and not embedded devices.
I don't know about the implications of your decision, but comment makes me feel uneasy. I don't know how you can possibly make that assumption? Have you looked around at the devices people actually use these days? PC's are an endangered and dying species... I couldn't imagine a worse assumption if it influences the application of D on different systems.
I'm not saying that embedded isn't important. It's just that for low level stuff like memory management it requires a completely different mindset. RegionAllocator is meant to be fast and simple at the expense of space efficiency. In embedded you'd probably want completely different tradeoffs. Depending on how deeply embedded, space efficiency might be the most important thing. I don't know exactly what tradeoffs you'd want, though, since I don't do embedded development. My guess is that you'd want something completely different, not RegionAllocator plus a few tweaks that would complicate it for PC use. Therefore, I designed RegionAllocator for PCs with no consideration for embedded environments.
Feb 02 2012
parent Manu <turkeyman gmail.com> writes:
On 2 February 2012 21:15, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 18:06:24 UTC, Manu wrote:

 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:

  On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:
  An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.
at least assuming we're targeting PCs and not embedded devices.
I don't know about the implications of your decision, but comment makes me feel uneasy. I don't know how you can possibly make that assumption? Have you looked around at the devices people actually use these days? PC's are an endangered and dying species... I couldn't imagine a worse assumption if it influences the application of D on different systems.
I'm not saying that embedded isn't important. It's just that for low level stuff like memory management it requires a completely different mindset. RegionAllocator is meant to be fast and simple at the expense of space efficiency. In embedded you'd probably want completely different tradeoffs. Depending on how deeply embedded, space efficiency might be the most important thing. I don't know exactly what tradeoffs you'd want, though, since I don't do embedded development. My guess is that you'd want something completely different, not RegionAllocator plus a few tweaks that would complicate it for PC use. Therefore, I designed RegionAllocator for PCs with no consideration for embedded environments.
Okay, this reasoning seems sound to me. I wonder though, is it easy/possible/compatible to plug a new back end in for embedded systems? It should definitely conserve memory at all costs, but above all, it needs to be fast. Short term embedded platforms will surely have some lenience with memory size, but demand high performance. What sort of overheads are we talking about for these allocators? I wonder what the minimum footprint of a D app would be, ie. exe size + minimum memory allocation for the runtime/etc?
Feb 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, February 02, 2012 20:06:14 Manu wrote:
 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:
 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:
 An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.
at least assuming we're targeting PCs and not embedded devices.
I don't know about the implications of your decision, but comment makes me feel uneasy. I don't know how you can possibly make that assumption? Have you looked around at the devices people actually use these days? PC's are an endangered and dying species... I couldn't imagine a worse assumption if it influences the application of D on different systems.
PCs are not endangered in the least. It's just that we're getting an increase in other devices (particularly smart phones and tablets). PCs are _heavily_ used, and there's no way that smart phones or tablets could replace them. They do different stuff. It _is_ true that applications are increasingly being written for non-PCs, but PCs definitely aren't dying off. Also, how much do you really treat smart phones or tablets like embedded devices rather than PCs? They're certainly more like PCs than the embedded devices of yore. True, they have stricter performance requirements, but they're nowhere near as restrictive as they used to be. - Jonathan M Davis
Feb 02 2012
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/2/12, Manu <turkeyman gmail.com> wrote:
 PC's are an endangered and dying species...
Kind of like when we got rid of cars and trains and ships once we started making jumbo jets. Oh wait, that didn't happen.
Feb 02 2012
parent reply "dsimcha" <dsimcha yahoo.com> writes:
On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic 
wrote:
 On 2/2/12, Manu <turkeyman gmail.com> wrote:
 PC's are an endangered and dying species...
Kind of like when we got rid of cars and trains and ships once we started making jumbo jets. Oh wait, that didn't happen.
Agreed. I just recently got my first smartphone and I love it. I see it as a complement to a PC, though, not as a substitute. It's great for when I'm on the go, but when I'm at home or at work I like a bigger screen, a full keyboard, a faster processor, more memory, etc. Of course smartphones will get more powerful but I doubt any will ever have dual 22 inch monitors.
Feb 02 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 02, 2012 at 08:20:55PM +0100, dsimcha wrote:
 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:
On 2/2/12, Manu <turkeyman gmail.com> wrote:
PC's are an endangered and dying species...
Kind of like when we got rid of cars and trains and ships once we started making jumbo jets. Oh wait, that didn't happen.
Agreed. I just recently got my first smartphone and I love it. I see it as a complement to a PC, though, not as a substitute. It's great for when I'm on the go, but when I'm at home or at work I like a bigger screen, a full keyboard, a faster processor, more memory, etc. Of course smartphones will get more powerful but I doubt any will ever have dual 22 inch monitors.
[...] Funny you should mention that, a number of years ago I got to know a guy who worked in a lab that was researching organic polymer-based electronics, which lets you build things like screens and keyboards that can be rolled up or folded and put into your pocket. It was still experimental technology at the time, though, and I'm not expecting it to be publicly available anytime soon. But the day may come when your smartphone *can* literally have a 22" monitor (that folds into a pocket sized card). T -- Famous last words: I wonder what will happen if I do *this*...
Feb 02 2012
next sibling parent "a" <a a.com> writes:
 But the day may come when your
 smartphone *can* literally have a 22" monitor (that folds into 
 a pocket
 sized card).


 T
This doesn't sound very practical.
Feb 02 2012
prev sibling next sibling parent Richard Webb <webby beardmouse.org.uk> writes:
On 02/02/2012 19:32, H. S. Teoh wrote:
 But the day may come when your
 smartphone *can* literally have a 22" monitor (that folds into a pocket
 sized card).
Nah, what you really need are those holographic displays they use on CSI Miami. No need for physical displays at all ;-)
Feb 02 2012
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/02/2012 08:32 PM, H. S. Teoh wrote:
 On Thu, Feb 02, 2012 at 08:20:55PM +0100, dsimcha wrote:
 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:
 On 2/2/12, Manu<turkeyman gmail.com>  wrote:
 PC's are an endangered and dying species...
Kind of like when we got rid of cars and trains and ships once we started making jumbo jets. Oh wait, that didn't happen.
Agreed. I just recently got my first smartphone and I love it. I see it as a complement to a PC, though, not as a substitute. It's great for when I'm on the go, but when I'm at home or at work I like a bigger screen, a full keyboard, a faster processor, more memory, etc. Of course smartphones will get more powerful but I doubt any will ever have dual 22 inch monitors.
[...] Funny you should mention that, a number of years ago I got to know a guy who worked in a lab that was researching organic polymer-based electronics, which lets you build things like screens and keyboards that can be rolled up or folded and put into your pocket. It was still experimental technology at the time, though, and I'm not expecting it to be publicly available anytime soon. But the day may come when your smartphone *can* literally have a 22" monitor (that folds into a pocket sized card). T
Chances are that region allocator can run on mobile phones just fine at the time that happens. ;)
Feb 02 2012
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Long term I suspect the typical desktop experience will just be a phone or o=
ther mobile device talking wirelessly to input and display peripherals.

On Feb 2, 2012, at 11:20 AM, "dsimcha" <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:
 On 2/2/12, Manu <turkeyman gmail.com> wrote:
 PC's are an endangered and dying species...
=20 Kind of like when we got rid of cars and trains and ships once we started making jumbo jets. =20 Oh wait, that didn't happen.
=20 Agreed. I just recently got my first smartphone and I love it. I see it a=
s a complement to a PC, though, not as a substitute. It's great for when I'= m on the go, but when I'm at home or at work I like a bigger screen, a full k= eyboard, a faster processor, more memory, etc. Of course smartphones will g= et more powerful but I doubt any will ever have dual 22 inch monitors.
Feb 02 2012
parent "F i L" <witte2008 gmail.com> writes:
People and, more importantly, professionals will always want the 
most power they can get to do their work. Whether that power 
remains on the client system or is off-set to the home/office 
server or cloud is the big question.

I think the future will see an increase of local servers which 
intelligently allocate resources to transport-friendly modular 
clients devices in both homes and businesses.
Feb 02 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 02, 2012 at 01:22:54PM -0800, Sean Kelly wrote:
 Long term I suspect the typical desktop experience will just be a
 phone or other mobile device talking wirelessly to input and display
 peripherals.
[...] Funny, it comes full-circle to the good ole days of dumb X terminals that only handle input/display, and leave the actual computing to the server. Except now the server is no longer an oversized ceiling-high machine in the server room, but a tablet you can fit into your pocket. :) However, due to the inconvenience of needing to be near a bulky monitor and keyboard, I suspect rather that the mobile device will come with a built-in projector and foldable keyboard. The latter already exists; the former allows the convenience of usability anywhere there's a large flat surface. T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Feb 02 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
On 2 February 2012 21:20, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:

 On 2/2/12, Manu <turkeyman gmail.com> wrote:

 PC's are an endangered and dying species...
Kind of like when we got rid of cars and trains and ships once we started making jumbo jets. Oh wait, that didn't happen.
Agreed. I just recently got my first smartphone and I love it. I see it as a complement to a PC, though, not as a substitute. It's great for when I'm on the go, but when I'm at home or at work I like a bigger screen, a full keyboard, a faster processor, more memory, etc. Of course smartphones will get more powerful but I doubt any will ever have dual 22 inch monitors.
You're obviously a nerd though, hanging out in a place like this :) There is no small number of people who use their PC's purely for communication, facebook, emails, skype, etc. and there are well documented trends showing more and more people are performing those tasks exclusively from their phones/portable devices. They are now, already, ex-pc-users. That trend is accelerating (I wish I could remember where the slides were
_<)
My girlfriend for instance, I bought an iPad to do some dev on... she confiscated it one day after I showed her the eBook app, which she loved, and promptly threw away her whole hard copy library... Funnily though, and completely unconsciously on her part, I haven't seen her open her computer more than once or twice since that time. It has everything on it she ever used her computer for, it's more portable and convenient, fits in her bag (takes it to class), and it even has a recipe app and conveniently sits in the bench while she's baking :) I have had the same discussion in the office, and I'm not the only one who has seem a similar transition first hand. Most people aren't computer nerds... they don't give 2 shits about computers, it's just a tool, and the second something more practical or convenient comes along for their purposes, they'll use that instead. That time has already past, the trend is accelerating quickly, the snowball is rolling, even REAL nerds are starting to slip. (I must confess, I'm seriously tempted by a transformer prime...) I think the MOST interesting part about this trend though, is that many of these people who used their laptop/pc *exclusively* for communication have now discovered these mobile app stores, and they're being exposed to apps that do all sorts of things. Far more than they ever did on their PC's of days past. Even games; people who would say they have no interest in video games at all are being shown to be picking up simple mobile/tablet games at an alarming rate. This is a huge and fast growing software market, almost all the software is written from scratch (read: developers *could* choose D if it were available), and D should be a part of it! The language landscape in this space is very turbulent at the moment, but it will settle, and D needs to be in it before it does, or it will miss the boat.
Feb 02 2012
prev sibling parent Manu <turkeyman gmail.com> writes:
On 2 February 2012 23:22, Sean Kelly <sean invisibleduck.org> wrote:

 Long term I suspect the typical desktop experience will just be a phone or
 other mobile device talking wirelessly to input and display peripherals.
'Long term'? I think you're being a bit pessimistic. It's already here! http://eee.asus.com/eeepad/transformer-prime/features/ (do want!) I give it... maybe 2-3 years before it's standard (a very significant portion of the hardware landscape). I don't think that's very long... certainly not a lot of time for D to get its shit together! ;)
Feb 02 2012
prev sibling parent Manu <turkeyman gmail.com> writes:
On 2 February 2012 20:13, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 On Thursday, February 02, 2012 20:06:14 Manu wrote:
 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:
 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:
 An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.
at least assuming we're targeting PCs and not embedded devices.
I don't know about the implications of your decision, but comment makes
me
 feel uneasy.

 I don't know how you can possibly make that assumption? Have you looked
 around at the devices people actually use these days?
 PC's are an endangered and dying species... I couldn't imagine a worse
 assumption if it influences the application of D on different systems.
PCs are not endangered in the least. It's just that we're getting an increase in other devices (particularly smart phones and tablets). PCs are _heavily_ used, and there's no way that smart phones or tablets could replace them. They do different stuff. It _is_ true that applications are increasingly being written for non-PCs, but PCs definitely aren't dying off. Also, how much do you really treat smart phones or tablets like embedded devices rather than PCs? They're certainly more like PCs than the embedded devices of yore. True, they have stricter performance requirements, but they're nowhere near as restrictive as they used to be.
They're restrictive devices with weaker hardware, yet people seem to generally expect a roughly equivalent experience from them as they get from their PC. A modern PC's software can be written in any language the programmer likes, and expect that the result will generally run well. Performance is almost a 'solved' problem on PC's for most applications. I really just argue that D, a so called systems language [this is surely D's primary offering/niche, or am I mistaken? It basically competes with C/C++, and on roughly the same terms], needs to constantly consider (and potentially even favour) the performance requirements of these weaker systems. They are the devices with the highest demand for efficiency placed on them. For productivity applications, PC's will remain dominant, sure, but consider the number of productivity applications vs the number of entertainment products/games/apps/toys. The scope doesn't even compare. Also consider that productivity software is rarely written from scratch. Most productivity software people use is already written, only updated now, and isn't going to restart in D any time soon. Games/apps/toys, in terms of number of pieces of software developed, and number of developers writing such software (read: **NEW** software, where the choice of using a new language like D is a *real* possibility), completely blows the PC developer base SOOO far out of the water it's barely quantifiable, probably thousands to 1. That developer base are not working on x86 platforms, they're targeting games consoles, phones, tablets... TV's will be big very soon. They are a large community, grown very tired of C/C++, want a language that's not shit, and that is exactly as efficient as C/C++ when used correctly. D fits this bill like a glove, and also has the advantage of being compatible with their existing libraries. Embedded hardware toyed with the notion of using managed languages for a little while, not even offering C toolchains at first (Android, Windows Phone), but that failed, and both revoked that policy. The bar is raising fast on those platforms. Everyone producing high end products required native development to remain competitive. There's really not a lot of modern native language competition out there, D is positioned well, and with this fact in mind, D's performance must be as comparable as possible to C/C++. In fact, D has potential for many performance *advantages* over C/C++, but the GC is a key concern for me personally... and if possible to tune it for these systems, it should be. They require it more, and PC honestly won't care that much :) Apologies for once again derailing an unrelated topic! :) I really just can't handle seeing this sort of presumption constantly popping up all over the news group. This PC-centric mentality has to go if D is to have any commercial success. The commercial software industry moved on years ago, it's not on PC anymore... the D community needs to become intimately aware of that fact, whether they like it or not. (presuming of course that the intent here if for D to be successful :)
Feb 02 2012
prev sibling parent reply Richard Webb <richard.webb boldonjames.com> writes:
Out of interest, i just tried loading the same file with std.xml, and 
the performance there is pretty similar in each version, possibly 
slightly slower in 2.058 (~21-22 seconds in each case).

Disabling the GC during the load gets 9 seconds, though task manager 
reports a peak memory usage of almost 600 megabytes in that case!

It looks like most of the time here is spent in Gcxmark whereas with 
xmlp it was in Gcxfullcollect (and fullcollect is the one that is faster 
in 2.058).
The profiler makes it look like things are spending more time in Gcxmark 
than they were before. Is that the case?


I'll try to have a go with Tango when i get some more time.
Feb 02 2012
parent Michael Rynn <michaelrynn optusnet.com.au> writes:
On Thu, 02 Feb 2012 15:44:56 +0000, Richard Webb wrote:

 
With the xml package (xmlp) , and the linked node DOM, the GC is likely to fail cleanup. I divided the generated test file, with its 2 layer elements, into 5, 50, 500, 5000 sized files. I put in a mixin on the Node class to do static counts of Node objects created and deleted. I also set the document to be read multiple times, calling GC.collect between each one, and timed it, and the increase in read time, and the increase in GC.collect time, was strong evidence that the GC was unable to clean it up, even if, to my naive code inspection, the previous document instance was no longer reachable from GC "roots", ie, should have been directly replaced on stack. The linkdom, rather too faithfully copies the java dom model, with double- linked pointers. Elements have an "owner document" (I felt like ditching this). All element children derive from ChildNode (Element, Text, CDATA and Comment), and have a pointer to the parent_ node. (could ditch this, maybe for CharData). Anyway, its a pretty linked up conventional model. Java GC, evidently has got around problems of GC xml documents. When I looked at C code for libxml2, for instance, all the Nodes have that sort of interlocked linkage. I tried once to see how far I could get (on windows) calling the libxml2 from D. I did not get very far, and gave up after crashes and a headache. On the GC cleanup stats, with 5 nodes of the generated test was cleanable, but on 50,000, well, the poor GC could not delete a single Node object. I've only done this with dmd2_32 on windows. 64 bit may be different. Its also Github release of 2.058 - 162, and may not be in releasable state for GC. The original std.xml seemed to fair best, on GC cleanup, but still developed troubles on each larger document size. std.xmlp.arraydom was similar. Both of these have no back pointers to parents, use an array to hold children. I thought of just nulling the backpointers, and wrote a version of an explode method. I tried this with the node version doing, or not doing a delete. With a big document, it currently has to do a delete, or the GC will not clean up everything. I turned the gc stats counters, into a mixin template, and put them on the std.xml1, and on the std.arraydom, and any intermediate objects from the parsing, that I wanted to confirm were not hanging around. For to many of them, I found it necessary to chase them down, and force delete (I now prefer explode, because delete implies a complete cleanup, and for me, explde means to split into little non-referencing pieces, that the GC can manage). This means I use too many pointer linked structures for the GC. It may be just as difficult for RedBlack tree in std.container, I haven't checked. For me one lesson at least, is, do not assume a set of interlocked classes, with circular references, can always be isolated in a single full GC collection, and properly deleted. For internal temporaries, that will never be seen by external code, it seems a good idea to explode them. For large complex tree structures, it might be less CPU time to explode them, rather than this particular version of GC having to work it out with less information. Do explode again. After exploding, repeated runs and bigger loads ran optimally. So I read with interest other posts on mechanisms of GC and how to integrate with the compiler to tag classes and stack layout, to know pointers precisely. I particularly liked the suggestion for mixin on each class, for GC layout. What could be done, even adhoc, is to do GC stats on various kinds of complex self referring classes, to examine the effectiveness of GC collection. This could even be a compiler flag, or version. It was useful for me to spot where I could reuse a class rather than reconstruct new. A lot of D applications may be run once, and GC doesn't matter so much. But for long running games, and servers, verified GC behaviour will bring more confidence. Aggressive measurement may allow better detection of where and when the GC fails, and allow better assessment of effects of GC code changes. GC is a tough problem. A hard one to write a whole language based around it.
Feb 25 2012