digitalmars.D - [xmlp] the recent garbage collector performance improvements

Richard Webb (19/19) Feb 01 2012 Last night I tried loading a ~20 megabyte xml file using xmlp (via the

dsimcha (14/39) Feb 01 2012 Interesting. I'm glad my improvements seem to matter in the real

Martin Nowak (2/2) Feb 01 2012 XML is a prime example of a shallow tree structure,
Richard Webb (21/24) Feb 01 2012 The 'test' is just

Marco Leise (9/14) Feb 01 2012 Speaking of which, why not also compare the memory consumption (peak
bearophile (4/13) Feb 01 2012 Not too much time ago Python devs have added an heuristic to the Python ...

dsimcha (5/22) Feb 01 2012 I actually tried to add something like this a while back but I

dsimcha (4/31) Feb 01 2012 Wait a minute, since when do we even have a std.xml2? I've never

a (5/53) Feb 01 2012 It looks like someone wrote it a year ago, but it was never added

Richard Webb (3/7) Feb 02 2012 Thats the one -> http://www.dsource.org/projects/xmlp/

Jesse Phillips (9/12) Feb 02 2012 As others point out it is xmlp, I've been using it for the last

Jonathan M Davis (6/20) Feb 02 2012 In theory, Tomek Sowiński was working on a replacement for std.xml whic...
Richard Webb (10/12) Feb 02 2012 The GC hit is related to the number of dom nodes that exist at one i

H. S. Teoh (24/24) Feb 02 2012 Speaking of GC improvements, I googled around a bit and found this

Robert Jacques (2/25) Feb 02 2012 So to answer your question, yes, someone has made one of these types of ...

H. S. Teoh (24/32) Feb 02 2012 To me, the final goal should be to make the GC swappable, that is,

Jacob Carlborg (4/34) Feb 03 2012 The GC is already swappable during linking.

H. S. Teoh (7/8) Feb 03 2012 [...]

Jesse Phillips (5/13) Feb 01 2012 Interesting, I've been using XmlVisitor on two 30meg and one

Richard Webb (8/12) Feb 01 2012 For reference, the file i was testing with has ~50000 root nodes, each

dsimcha (8/12) Feb 01 2012 Sounds about right. For very small allocations sweeping time

H. S. Teoh (10/23) Feb 01 2012 Out of curiosity, is there a way to optimize for the "many small

David Nadlinger (4/9) Feb 01 2012 The easiest way would be to use a specialized allocator for this – I'm...

bearophile (7/9) Feb 01 2012 dsimcha has done a good work on the GC. After this change I have seen th...

dsimcha (10/20) Feb 01 2012 My RegionAllocator is probably the best thing for this if the

Robert Jacques (2/25) Feb 01 2012 An XML parser would probably want some kind of stack segment growth sche...

dsimcha (10/12) Feb 02 2012 I had considered putting that in RegionAllocator but I was

Manu (7/12) Feb 02 2012 I don't know about the implications of your decision, but comment makes ...

dsimcha (13/34) Feb 02 2012 I'm not saying that embedded isn't important. It's just that for

Manu (10/40) Feb 02 2012 Okay, this reasoning seems sound to me. I wonder though, is it

Jonathan M Davis (11/25) Feb 02 2012 PCs are not endangered in the least. It's just that we're getting an inc...
Andrej Mitrovic (4/5) Feb 02 2012 Kind of like when we got rid of cars and trains and ships once we

dsimcha (8/14) Feb 02 2012 Agreed. I just recently got my first smartphone and I love it.

H. S. Teoh (13/28) Feb 02 2012 [...]

a (1/6) Feb 02 2012 This doesn't sound very practical.
Richard Webb (3/6) Feb 02 2012 Nah, what you really need are those holographic displays they use on CSI...
Timon Gehr (3/29) Feb 02 2012 Chances are that region allocator can run on mobile phones just fine at

Sean Kelly (7/17) Feb 02 2012 Long term I suspect the typical desktop experience will just be a phone ...

F i L (7/7) Feb 02 2012 People and, more importantly, professionals will always want the

H. S. Teoh (15/18) Feb 02 2012 [...]
Manu (36/53) Feb 02 2012 You're obviously a nerd though, hanging out in a place like this :)
Manu (6/8) Feb 02 2012 'Long term'? I think you're being a bit pessimistic. It's already here!

Manu (49/76) Feb 02 2012 They're restrictive devices with weaker hardware, yet people seem to

Richard Webb (11/11) Feb 02 2012 Out of interest, i just tried loading the same file with std.xml, and

Michael Rynn (66/67) Feb 25 2012 With the xml package (xmlp) , and the linked node DOM, the GC is likely

Richard Webb <webby beardmouse.org.uk> writes:

Last night I tried loading a ~20 megabyte xml file using xmlp (via the
DocumentBuilder.LoadFile function) and a recent dmd build, and found that it
took ~48 seconds to complete, which is rather poor.
I tried running it through a profiler, and that said that almost all the
runtime was spent inside the garbage collector.

I then tried the same test using the latest Git versions of dmd/druntime (with
pull request 108 merged in), and that took less than 10 seconds. This is a
rather nice improvement, though still somewhat on the slow side.

Some profiler numbers, if anyone is interested:

Old version:
Gcxfullcollect: 31.14 seconds, 69.26% runtime.
Gcxmark: 4.84 seconds, 10.77% runtime.
Gcxfindpool: 2.10 seconds, 4.67% runtime.

New version:
Gcxmark: 11.67 seconds, 50.77% runtime.
Gcxfindpool: 3.58 seconds, 15.55% runtime.
Gcxfullcollect: 1.69 seconds, 7.37% runtime.

(Assuming that Sleepy is giving me accurate numbers. The new version is
definately faster though).

Feb 01 2012

"dsimcha" <dsimcha yahoo.com> writes:

Interesting.  I'm glad my improvements seem to matter in the real 
world, though I'm thoroughly impressed with the amount of 
speedup.  Even the small allocation benchmark that I was 
optimizing only sped up by ~50% from 2.057 to 2.058 overall and 
~2x in collection time.  I'd be very interested if you could make 
a small, self-contained test program to use as a benchmark.

GC performance is one of D's biggest weak spots, so it would 
probably be a good bit of marketing to show that the performance 
is substantially better than it used to be even if it's not great 
yet.  Over the past year I've been working on and off at speeding 
it up.  It's now at least ~2x faster than it was last year at 
this time on every benchmark I've tried and up to several hundred 
times faster in the extreme case of huge allocations.

On Wednesday, 1 February 2012 at 18:33:58 UTC, Richard Webb wrote:
 Last night I tried loading a ~20 megabyte xml file using xmlp 
 (via the
 DocumentBuilder.LoadFile function) and a recent dmd build, and 
 found that it
 took ~48 seconds to complete, which is rather poor.
 I tried running it through a profiler, and that said that 
 almost all the
 runtime was spent inside the garbage collector.

 I then tried the same test using the latest Git versions of 
 dmd/druntime (with
 pull request 108 merged in), and that took less than 10 
 seconds. This is a
 rather nice improvement, though still somewhat on the slow side.

 Some profiler numbers, if anyone is interested:

 Old version:
 Gcxfullcollect: 31.14 seconds, 69.26% runtime.
 Gcxmark: 4.84 seconds, 10.77% runtime.
 Gcxfindpool: 2.10 seconds, 4.67% runtime.

 New version:
 Gcxmark: 11.67 seconds, 50.77% runtime.
 Gcxfindpool: 3.58 seconds, 15.55% runtime.
 Gcxfullcollect: 1.69 seconds, 7.37% runtime.

 (Assuming that Sleepy is giving me accurate numbers. The new 
 version is
 definately faster though).

Feb 01 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

XML is a prime example of a shallow tree structure,
so it's the prime target for the recent optimization.

Feb 01 2012

Richard Webb <webby beardmouse.org.uk> writes:

On 01/02/2012 19:35, dsimcha wrote:
 I'd be very
 interested if you could make a small, self-contained test program to use
 as a benchmark.


The 'test' is just

/////////////////
import std.xml2;

void main()
{
     string xmlPath = r"test.xml";

     auto document = DocumentBuilder.LoadFile(xmlPath, false, false);
}

/////////////////

It's xmlp that does all the work (and takes all the time).


I'll see about generating a simple test file, but basically:

50000 top level nodes
each one has 6 child nodes
each node has a single attribute, and the child nodes each have a short 
text value.


Parsing the file with DMD 2.057 takes ~25 seconds

Parsing the file with DMD 2.058(Git) takes ~6.1 seconds

Parsing the file with DMD 2.058, with the GC disabled during the 
LoadFile call, takes ~2.2 seconds.


For comparison, MSXML6 takes 1.6 seconds to load the same file.

Feb 01 2012

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 02.02.2012, 01:41 Uhr, schrieb Richard Webb <webby beardmouse.org.uk>:

 Parsing the file with DMD 2.057 takes ~25 seconds

 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds

 Parsing the file with DMD 2.058, with the GC disabled during the  
 LoadFile call, takes ~2.2 seconds.


 For comparison, MSXML6 takes 1.6 seconds to load the same file.

Speaking of which, why not also compare the memory consumption (peak  
working set size)? Memory vs. CPU is the typical trade-off, so it might be  
interesting from that point, but I also wonder what the overhead for GC  
managed memory is - assuming that MSXML6 uses only ref counting. And if it  
is a whole lot more (like >+50%), what methods could apply, that 'waste'  
less space.

This API function should get the job done: GetProcessMemoryInfo
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683219(v=vs.85).aspx

Feb 01 2012

bearophile <bearophileHUGS lycos.com> writes:

Richard Webb:

 Parsing the file with DMD 2.057 takes ~25 seconds
 
 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds
 
 Parsing the file with DMD 2.058, with the GC disabled during the 
 LoadFile call, takes ~2.2 seconds.
 
 
 For comparison, MSXML6 takes 1.6 seconds to load the same file.

Not too much time ago Python devs have added an heuristic to the Python GC
(that is a reference counter + cycle breaker), it "switches off" if it detects
the program is allocating many items in a short time. Is it possible to add
something similar to the D GC?

Bye,
bearophile

Feb 01 2012

"dsimcha" <dsimcha yahoo.com> writes:

On Thursday, 2 February 2012 at 01:27:44 UTC, bearophile wrote:
 Richard Webb:

 Parsing the file with DMD 2.057 takes ~25 seconds
 
 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds
 
 Parsing the file with DMD 2.058, with the GC disabled during 
 the LoadFile call, takes ~2.2 seconds.
 
 
 For comparison, MSXML6 takes 1.6 seconds to load the same file.

 Not too much time ago Python devs have added an heuristic to 
 the Python GC (that is a reference counter + cycle breaker), it 
 "switches off" if it detects the program is allocating many 
 items in a short time. Is it possible to add something similar 
 to the D GC?

 Bye,
 bearophile

I actually tried to add something like this a while back but I 
couldn't find a heuristic that worked reasonably well.  The idea 
was just to create a timeout where the GC can't run for x 
milliseconds after it just ran.

Feb 01 2012

"dsimcha" <dsimcha yahoo.com> writes:

Wait a minute, since when do we even have a std.xml2?  I've never 
heard of it and it's not in the Phobos source tree (I just 
checked).

On Thursday, 2 February 2012 at 00:41:31 UTC, Richard Webb wrote:
 On 01/02/2012 19:35, dsimcha wrote:
 I'd be very
 interested if you could make a small, self-contained test 
 program to use
 as a benchmark.


 The 'test' is just

 /////////////////
 import std.xml2;

 void main()
 {
    string xmlPath = r"test.xml";

    auto document = DocumentBuilder.LoadFile(xmlPath, false, 
 false);
 }

 /////////////////

 It's xmlp that does all the work (and takes all the time).


 I'll see about generating a simple test file, but basically:

 50000 top level nodes
 each one has 6 child nodes
 each node has a single attribute, and the child nodes each have 
 a short text value.


 Parsing the file with DMD 2.057 takes ~25 seconds

 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds

 Parsing the file with DMD 2.058, with the GC disabled during 
 the LoadFile call, takes ~2.2 seconds.


 For comparison, MSXML6 takes 1.6 seconds to load the same file.

Feb 01 2012

"a" <a a.com> writes:

It looks like someone wrote it a year ago, but it was never added 
to phobos: 
http://www.digitalmars.com/d/archives/digitalmars/D/announce/std.xml2_
andidate_19804.html 
.

On Thursday, 2 February 2012 at 04:39:11 UTC, dsimcha wrote:
 Wait a minute, since when do we even have a std.xml2?  I've 
 never heard of it and it's not in the Phobos source tree (I 
 just checked).

 On Thursday, 2 February 2012 at 00:41:31 UTC, Richard Webb 
 wrote:
 On 01/02/2012 19:35, dsimcha wrote:
 I'd be very
 interested if you could make a small, self-contained test 
 program to use
 as a benchmark.


 The 'test' is just

 /////////////////
 import std.xml2;

 void main()
 {
  string xmlPath = r"test.xml";

  auto document = DocumentBuilder.LoadFile(xmlPath, false, 
 false);
 }

 /////////////////

 It's xmlp that does all the work (and takes all the time).


 I'll see about generating a simple test file, but basically:

 50000 top level nodes
 each one has 6 child nodes
 each node has a single attribute, and the child nodes each 
 have a short text value.


 Parsing the file with DMD 2.057 takes ~25 seconds

 Parsing the file with DMD 2.058(Git) takes ~6.1 seconds

 Parsing the file with DMD 2.058, with the GC disabled during 
 the LoadFile call, takes ~2.2 seconds.


 For comparison, MSXML6 takes 1.6 seconds to load the same file.

Feb 01 2012

Richard Webb <richard.webb boldonjames.com> writes:

On 02/02/2012 04:53, a wrote:
 It looks like someone wrote it a year ago, but it was never added to
 phobos:
 http://www.digitalmars.com/d/archives/digitalmars/D/announce/std.xml2_candidate_19804.html
 .


Thats the one -> http://www.dsource.org/projects/xmlp/
(the modules are under std. ).

Feb 02 2012

"Jesse Phillips" <jessekphillips+D gmail.com> writes:

On Thursday, 2 February 2012 at 04:39:11 UTC, dsimcha wrote:
 Wait a minute, since when do we even have a std.xml2?  I've 
 never heard of it and it's not in the Phobos source tree (I 
 just checked).

As others point out it is xmlp, I've been using it for the last 
year or two, and recently seen an improvement on performance (as 
mentioned). I don't know how comparable it is, and Richard seems 
to see a slow down from GC, for me disabling the GC during load 
doesn't change load time, but I'm not using the document loader.

I hope Michael will put some effort into getting it ready for 
review. But still, people should use it and get him some 
incentive to do the work.

Feb 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, February 02, 2012 18:11:34 Jesse Phillips wrote:
 On Thursday, 2 February 2012 at 04:39:11 UTC, dsimcha wrote:
 Wait a minute, since when do we even have a std.xml2? I've
 never heard of it and it's not in the Phobos source tree (I
 just checked).

 
 As others point out it is xmlp, I've been using it for the last
 year or two, and recently seen an improvement on performance (as
 mentioned). I don't know how comparable it is, and Richard seems
 to see a slow down from GC, for me disabling the GC during load
 doesn't change load time, but I'm not using the document loader.
 
 I hope Michael will put some effort into getting it ready for
 review. But still, people should use it and get him some
 incentive to do the work.

In theory, Tomek Sowiński was working on a replacement for std.xml which was 
the likely replacement for std.xml, but he seems to have disappeared... His 
last post was on an XML writer back in June. So, I don't know what the current 
situation with that is.

- Jonathan M Davis

Feb 02 2012

Richard Webb <webby beardmouse.org.uk> writes:

On 02/02/2012 17:11, Jesse Phillips wrote:
 for me disabling the GC during load doesn't change load time,
 but I'm not using the document loader.


The GC hit is related to the number of dom nodes that exist at one i 
think -> the visitor approach doesnt allocate the whole dom tree, so 
there are far fewer items (and less allocated memory for the gc to scan).
For comparison, parsing my test file using XmlVisitor takes less than 3 
seconds (over twice as fast as the DOM version).


I looked in to it a bit and found this: 
http://www.dsource.org/projects/xmlp/ticket/10

Seems that it's calling GC.qmalloc 350000+ times, mostly for 16byte 
blocks, when normalizing attributes. This doesn't seem hugely clever :-(

Feb 02 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

Speaking of GC improvements, I googled around a bit and found this
thread from 2 years ago:

	http://www.digitalmars.com/d/archives/digitalmars/D/Is_there_a_modern_GC_for_D_105967.html

Especially interesting is this paper:

	Nonintrusive Cloning Garbage Collector with Stock Operating
	System Support.
	        http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

Has anything been done along these lines since that time? Seems like
this particular GC algorithm is exactly the kind we need for D. It's a
conservative mark-and-sweep algo with a very low overhead(*), mark phase
concurrent with mutator thread(s), and lazy incremental sweeping at
allocation time.  Synchronization is automatically done by default OS
kernel-space mechanisms (copy on write memory pages).

More to the point, how easy/hard is it to switch between GCs in the
current D implementation(s)? I think it would be helpful if these kinds
of experimental GCs were available in addition to the current default
GC, and people can play around with them and find out which one(s) are
the cream of the crop. Otherwise we're just bottlenecked at a small
number of people who can actually play with GC algos for D -- which
means improvements will be slow.


(*) True on *nix systems anyway, but judging from comments in
that thread, Windoze also was acquiring equivalent functionality -- and
this being 2 years ago, I'm assuming it's available now.


--T

Feb 02 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Thu, 02 Feb 2012 19:31:50 -0600, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:

Speaking of GC improvements, I googled around a bit and found this
thread from 2 years ago:

http://www.digitalmars.com/d/archives/digitalmars/D/Is_there_a_modern_GC_for_D_105967.html

Especially interesting is this paper:

Nonintrusive Cloning Garbage Collector with Stock Operating
System Support.
http://www.cs.purdue.edu/homes/grr/snapshot-gc.ps

Has anything been done along these lines since that time? Seems like
this particular GC algorithm is exactly the kind we need for D. It's a
conservative mark-and-sweep algo with a very low overhead(*), mark phase
concurrent with mutator thread(s), and lazy incremental sweeping at
allocation time. Synchronization is automatically done by default OS
kernel-space mechanisms (copy on write memory pages).

More to the point, how easy/hard is it to switch between GCs in the
current D implementation(s)? I think it would be helpful if these kinds
of experimental GCs were available in addition to the current default
GC, and people can play around with them and find out which one(s) are
the cream of the crop. Otherwise we're just bottlenecked at a small
number of people who can actually play with GC algos for D -- which
means improvements will be slow.

(*) True on *nix systems anyway, but judging from comments in
that thread, Windoze also was acquiring equivalent functionality -- and
this being 2 years ago, I'm assuming it's available now.

So to answer your question, yes, someone has made one of these types of GC for
D called CDGC. No, it doesn't look like Windows will support this anytime soon.
And cloning GCs, don't solve the problems of large heaps, soft/hard realtime
issues like game render threads, and it actually makes the embarrassingly long
pause (when they happen) longer. Now, CDGC, as I understand it, is better
written than our current GC and would probably improve things, but I don't see
it as the final end goal.

Feb 02 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 02, 2012 at 08:48:20PM -0600, Robert Jacques wrote:
[...]
 So to answer your question, yes, someone has made one of these types
 of GC for D called CDGC. No, it doesn't look like Windows will support
 this anytime soon.  And cloning GCs, don't solve the problems of large
 heaps, soft/hard realtime issues like game render threads, and it
 actually makes the embarrassingly long pause (when they happen)
 longer. Now, CDGC, as I understand it, is better written than our
 current GC and would probably improve things, but I don't see it as
 the final end goal.

To me, the final goal should be to make the GC swappable, that is,
there's a way to tell the compiler which GC you want to use.

After I got convinced about the benefits of GC by the article on garbage
collection in DPLO, I started reading up a bit about different GC
algorithms, and my conclusion is that there is no such thing as a
one-size-fits-all GC.  Every GC algo out there is sensitive to the kind
of allocations the program makes, and the memory/speed constraints
you're running under. So the real solution to GC performance problems is
to make it swappable. Today, for instance, I found a book that talks
about a hard real-time GC, which is very different from the paper I
mentioned earlier.

The point is, if you want hard real-time guarantees, it should be
possible to tell the compiler to use a GC that gives you that, and if
you're OK with soft real-time, then you can tell the compiler to use
something like CDGC. And if you're writing a batch-processing program
you can use a GC with a long STW pause but better overall throughput.

Anyway, I found the blog of the guy who implemented CDGC, and it seems
that it did quite well performance-wise at the end. Apparently it's in
an experimental D2 branch; anyone tried to use it recently?


T

-- 
Questions are the beginning of intelligence, but the fear of God is the
beginning of wisdom.

Feb 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-02-03 05:01, H. S. Teoh wrote:
 On Thu, Feb 02, 2012 at 08:48:20PM -0600, Robert Jacques wrote:
 [...]
 So to answer your question, yes, someone has made one of these types
 of GC for D called CDGC. No, it doesn't look like Windows will support
 this anytime soon.  And cloning GCs, don't solve the problems of large
 heaps, soft/hard realtime issues like game render threads, and it
 actually makes the embarrassingly long pause (when they happen)
 longer. Now, CDGC, as I understand it, is better written than our
 current GC and would probably improve things, but I don't see it as
 the final end goal.

 To me, the final goal should be to make the GC swappable, that is,
 there's a way to tell the compiler which GC you want to use.

 After I got convinced about the benefits of GC by the article on garbage
 collection in DPLO, I started reading up a bit about different GC
 algorithms, and my conclusion is that there is no such thing as a
 one-size-fits-all GC.  Every GC algo out there is sensitive to the kind
 of allocations the program makes, and the memory/speed constraints
 you're running under. So the real solution to GC performance problems is
 to make it swappable. Today, for instance, I found a book that talks
 about a hard real-time GC, which is very different from the paper I
 mentioned earlier.

 The point is, if you want hard real-time guarantees, it should be
 possible to tell the compiler to use a GC that gives you that, and if
 you're OK with soft real-time, then you can tell the compiler to use
 something like CDGC. And if you're writing a batch-processing program
 you can use a GC with a long STW pause but better overall throughput.

 Anyway, I found the blog of the guy who implemented CDGC, and it seems
 that it did quite well performance-wise at the end. Apparently it's in
 an experimental D2 branch; anyone tried to use it recently?


 T

The GC is already swappable during linking.

-- 
/Jacob Carlborg

Feb 03 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Feb 03, 2012 at 09:30:36AM +0100, Jacob Carlborg wrote:
[...]
 The GC is already swappable during linking.

[...]

Is there documentation for this?


T

-- 
No! I'm not in denial!

Feb 03 2012

"Jesse Phillips" <jessekphillips+D gmail.com> writes:

On Wednesday, 1 February 2012 at 18:33:58 UTC, Richard Webb wrote:
 Last night I tried loading a ~20 megabyte xml file using xmlp 
 (via the
 DocumentBuilder.LoadFile function) and a recent dmd build, and 
 found that it
 took ~48 seconds to complete, which is rather poor.
 I tried running it through a profiler, and that said that 
 almost all the
 runtime was spent inside the garbage collector.

Interesting, I've been using XmlVisitor on two 30meg and one 
10meg, load time for all files being 30secs, this is actually an 
improvement from an older version of xmlp which took 60secs. Not 
sure if DocumentBuilder would be much different.

Feb 01 2012

Richard Webb <webby beardmouse.org.uk> writes:

On 01/02/2012 20:02, Jesse Phillips wrote:
 Interesting, I've been using XmlVisitor on two 30meg and one 10meg, load
 time for all files being 30secs, this is actually an improvement from an
 older version of xmlp which took 60secs. Not sure if DocumentBuilder
 would be much different.


For reference, the file i was testing with has ~50000 root nodes, each 
of which has several children.
The number of nodes seems to have a much larger effect on the speed that 
the amount of data.


I tried it with an old version of KXML and that seems somewhat faster, 
although the latest version doesn't seem to build with 2.058 and it's a 
more basic library.

Feb 01 2012

"dsimcha" <dsimcha yahoo.com> writes:

On Wednesday, 1 February 2012 at 22:53:11 UTC, Richard Webb wrote:
 For reference, the file i was testing with has ~50000 root 
 nodes, each of which has several children.
 The number of nodes seems to have a much larger effect on the 
 speed that the amount of data.

Sounds about right.  For very small allocations sweeping time 
dominates the total GC time.  You can see the breakdown at 
https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 
.  The Tree1 benchmark is the very small allocation benchmark.  
Sweeping takes time linear in the number of memory blocks 
allocated and, for blocks <1 page, constant time in the size of 
the blocks.

Feb 01 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 02, 2012 at 12:09:01AM +0100, dsimcha wrote:
 On Wednesday, 1 February 2012 at 22:53:11 UTC, Richard Webb wrote:
For reference, the file i was testing with has ~50000 root nodes,
each of which has several children.
The number of nodes seems to have a much larger effect on the
speed that the amount of data.

 
 Sounds about right.  For very small allocations sweeping time
 dominates the total GC time.  You can see the breakdown at
 https://github.com/dsimcha/druntime/wiki/GC-Optimizations-Round-2 .
 The Tree1 benchmark is the very small allocation benchmark.
 Sweeping takes time linear in the number of memory blocks allocated
 and, for blocks <1 page, constant time in the size of the blocks.

Out of curiosity, is there a way to optimize for the "many small
allocations" case? E.g., if a function allocates, as temporary storage,
a tree with a large number of nodes, which becomes garbage when it
returns. Perhaps a way to sweep the entire space used by the tree in one
go?

Not sure if such a thing is possible.


T

-- 
Tell me and I forget. Teach me and I remember. Involve me and I understand. --
Benjamin Franklin

Feb 01 2012

David Nadlinger <see klickverbot.at> writes:

On 2/2/12 12:44 AM, H. S. Teoh wrote:
 Out of curiosity, is there a way to optimize for the "many small
 allocations" case? E.g., if a function allocates, as temporary storage,
 a tree with a large number of nodes, which becomes garbage when it
 returns. Perhaps a way to sweep the entire space used by the tree in one
 go?

The easiest way would be to use a specialized allocator for this – I'm 
thinking of David Simcha's proposed RegionAllocator (formerly TempAlloc).

David

Feb 01 2012

bearophile <bearophileHUGS lycos.com> writes:

dsimcha has done a good work on the GC. After this change I have seen that some
of my GC-heavy programs are just a bit slower, while most of them are
significantly faster, so overall it's a good improvement. In my opinion this is
one of the best things that DMD 2.058 brings (short lambda syntax is another
one of them, I am appreciating it a lot. Sometimes syntax sugar matters).

Still, even with dsimcha improvements, I think the current D GC is almost a toy
compared to some of the more advanced GCs that you are able to find around. So
is it right to work on small (but useful) incremental improvements on the
current GC, instead of working on creating a fully redesigned D GC (this is a
large work)? I don't know. But surely even incremental improvements are
welcome, because there are (or there were) enough low hanging fruits to pick.

I don't know how the current D GC performance scales to heaps of 5 or 30 Gbytes
size, composed of mostly small objects. From what I've seen this is a hard
problem even for almost-state-of-the-art GCs.


David Nadlinger:

 The easiest way would be to use a specialized allocator for this – I'm 
 thinking of David Simcha's proposed RegionAllocator (formerly TempAlloc).

I'd like a simple way to define a hierarchy of such allocations (and when you
free a node of the tree, the whole sub-tree gets deallocated).

Bye,
bearophile

Feb 01 2012

"dsimcha" <dsimcha yahoo.com> writes:

On Wednesday, 1 February 2012 at 23:43:24 UTC, H. S. Teoh wrote:
 Out of curiosity, is there a way to optimize for the "many small
 allocations" case? E.g., if a function allocates, as temporary 
 storage,
 a tree with a large number of nodes, which becomes garbage when 
 it
 returns. Perhaps a way to sweep the entire space used by the 
 tree in one
 go?

 Not sure if such a thing is possible.


 T

My RegionAllocator is probably the best thing for this if the 
lifetime is deterministic as you describe.  I rewrote the Tree1 
benchmark using RegionAllocator a while back just for comparison. 
  D Tree1 + RegionAllocator had comparable speed to a Java version 
of Tree1 run under HotSpot.  (About 6 seconds on my box vs. in 
the low 30s for Tree1 with the 2.058 GC.)

If all the objects are going to die at the same time but not at a 
deterministic time, you could just allocate a big block from the 
GC and place class instances in it using emplace().

Feb 01 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Wed, 01 Feb 2012 18:40:20 -0600, dsimcha <dsimcha yahoo.com> wrote:
 On Wednesday, 1 February 2012 at 23:43:24 UTC, H. S. Teoh wrote:
 Out of curiosity, is there a way to optimize for the "many small
 allocations" case? E.g., if a function allocates, as temporary
 storage,
 a tree with a large number of nodes, which becomes garbage when
 it
 returns. Perhaps a way to sweep the entire space used by the
 tree in one
 go?

 Not sure if such a thing is possible.


 T

 My RegionAllocator is probably the best thing for this if the
 lifetime is deterministic as you describe.  I rewrote the Tree1
 benchmark using RegionAllocator a while back just for comparison.
   D Tree1 + RegionAllocator had comparable speed to a Java version
 of Tree1 run under HotSpot.  (About 6 seconds on my box vs. in
 the low 30s for Tree1 with the 2.058 GC.)

 If all the objects are going to die at the same time but not at a
 deterministic time, you could just allocate a big block from the
 GC and place class instances in it using emplace().

An XML parser would probably want some kind of stack segment growth schedule,
which, IIRC isn't supported by RegionAllocator.

Feb 01 2012

"dsimcha" <dsimcha yahoo.com> writes:

On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques 
wrote:
 An XML parser would probably want some kind of stack segment 
 growth schedule, which, IIRC isn't supported by RegionAllocator.

I had considered putting that in RegionAllocator but I was 
skeptical of the benefit, at least assuming we're targeting PCs 
and not embedded devices.   The default segment size is 4MB.  
Trying to make the initial size any smaller won't save much 
memory.  Four megabytes is also big enough that new segments 
would be allocated so infrequently that the cost would be 
negligible.  I concluded that the added complexity wasn't 
justified.

Feb 02 2012

Manu <turkeyman gmail.com> writes:

On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:

 An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.

 at least assuming we're targeting PCs and not embedded devices.

I don't know about the implications of your decision, but comment makes me
feel uneasy.

I don't know how you can possibly make that assumption? Have you looked
around at the devices people actually use these days?
PC's are an endangered and dying species... I couldn't imagine a worse
assumption if it influences the application of D on different systems.

Feb 02 2012

"dsimcha" <dsimcha yahoo.com> writes:

On Thursday, 2 February 2012 at 18:06:24 UTC, Manu wrote:
 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques 
 wrote:

 An XML parser would probably want some kind of stack segment 
 growth
 schedule, which, IIRC isn't supported by RegionAllocator.

 at least assuming we're targeting PCs and not embedded devices.

 I don't know about the implications of your decision, but 
 comment makes me
 feel uneasy.

 I don't know how you can possibly make that assumption? Have 
 you looked
 around at the devices people actually use these days?
 PC's are an endangered and dying species... I couldn't imagine 
 a worse
 assumption if it influences the application of D on different 
 systems.

I'm not saying that embedded isn't important.  It's just that for 
low level stuff like memory management it requires a completely 
different mindset.  RegionAllocator is meant to be fast and 
simple at the expense of space efficiency.  In embedded you'd 
probably want completely different tradeoffs.  Depending on how 
deeply embedded, space efficiency might be the most important 
thing.  I don't know exactly what tradeoffs you'd want, though, 
since I don't do embedded development.  My guess is that you'd 
want something completely different, not RegionAllocator plus a 
few tweaks that would complicate it for PC use.  Therefore, I 
designed RegionAllocator for PCs with no consideration for 
embedded environments.

Feb 02 2012

Manu <turkeyman gmail.com> writes:

On 2 February 2012 21:15, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 18:06:24 UTC, Manu wrote:

 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:

  On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:
  An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.

 at least assuming we're targeting PCs and not embedded devices.

 I don't know about the implications of your decision, but comment makes me
 feel uneasy.

 I don't know how you can possibly make that assumption? Have you looked
 around at the devices people actually use these days?
 PC's are an endangered and dying species... I couldn't imagine a worse
 assumption if it influences the application of D on different systems.

 I'm not saying that embedded isn't important.  It's just that for low
 level stuff like memory management it requires a completely different
 mindset.  RegionAllocator is meant to be fast and simple at the expense of
 space efficiency.  In embedded you'd probably want completely different
 tradeoffs.  Depending on how deeply embedded, space efficiency might be the
 most important thing.  I don't know exactly what tradeoffs you'd want,
 though, since I don't do embedded development.  My guess is that you'd want
 something completely different, not RegionAllocator plus a few tweaks that
 would complicate it for PC use.  Therefore, I designed RegionAllocator for
 PCs with no consideration for embedded environments.

Okay, this reasoning seems sound to me. I wonder though, is it
easy/possible/compatible to plug a new back end in for embedded systems? It
should definitely conserve memory at all costs, but above all, it needs to
be fast.
Short term embedded platforms will surely have some lenience with memory
size, but demand high performance.

What sort of overheads are we talking about for these allocators? I wonder
what the minimum footprint of a D app would be, ie. exe size + minimum
memory allocation for the runtime/etc?

Feb 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, February 02, 2012 20:06:14 Manu wrote:
 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:
 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:
 An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.

 
 at least assuming we're targeting PCs and not embedded devices.

 
 I don't know about the implications of your decision, but comment makes me
 feel uneasy.
 
 I don't know how you can possibly make that assumption? Have you looked
 around at the devices people actually use these days?
 PC's are an endangered and dying species... I couldn't imagine a worse
 assumption if it influences the application of D on different systems.

PCs are not endangered in the least. It's just that we're getting an increase 
in other devices (particularly smart phones and tablets). PCs are _heavily_ 
used, and there's no way that smart phones or tablets could replace them. They 
do different stuff. It _is_ true that applications are increasingly being 
written for non-PCs, but PCs definitely aren't dying off.

Also, how much do you really treat smart phones or tablets like embedded 
devices rather than PCs? They're certainly more like PCs than the embedded 
devices of yore. True, they have stricter performance requirements, but 
they're nowhere near as restrictive as they used to be.

- Jonathan M Davis

Feb 02 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 2/2/12, Manu <turkeyman gmail.com> wrote:
 PC's are an endangered and dying species...

Kind of like when we got rid of cars and trains and ships once we
started making jumbo jets.

Oh wait, that didn't happen.

Feb 02 2012

"dsimcha" <dsimcha yahoo.com> writes:

On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic 
wrote:
 On 2/2/12, Manu <turkeyman gmail.com> wrote:
 PC's are an endangered and dying species...

 Kind of like when we got rid of cars and trains and ships once 
 we
 started making jumbo jets.

 Oh wait, that didn't happen.

Agreed.  I just recently got my first smartphone and I love it.  
I see it as a complement to a PC, though, not as a substitute.  
It's great for when I'm on the go, but when I'm at home or at 
work I like a bigger screen, a full keyboard, a faster processor, 
more memory, etc.  Of course smartphones will get more powerful 
but I doubt any will ever have dual 22 inch monitors.

Feb 02 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 02, 2012 at 08:20:55PM +0100, dsimcha wrote:
 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:
On 2/2/12, Manu <turkeyman gmail.com> wrote:
PC's are an endangered and dying species...

Kind of like when we got rid of cars and trains and ships once we
started making jumbo jets.

Oh wait, that didn't happen.

 
 Agreed.  I just recently got my first smartphone and I love it.  I see
 it as a complement to a PC, though, not as a substitute.  It's great
 for when I'm on the go, but when I'm at home or at work I like a
 bigger screen, a full keyboard, a faster processor, more memory, etc.
 Of course smartphones will get more powerful but I doubt any will ever
 have dual 22 inch monitors.

[...]

Funny you should mention that, a number of years ago I got to know a guy
who worked in a lab that was researching organic polymer-based
electronics, which lets you build things like screens and keyboards that
can be rolled up or folded and put into your pocket. It was still
experimental technology at the time, though, and I'm not expecting it to
be publicly available anytime soon. But the day may come when your
smartphone *can* literally have a 22" monitor (that folds into a pocket
sized card).


T

-- 
Famous last words: I wonder what will happen if I do *this*...

Feb 02 2012

"a" <a a.com> writes:

 But the day may come when your
 smartphone *can* literally have a 22" monitor (that folds into 
 a pocket
 sized card).


 T

This doesn't sound very practical.

Feb 02 2012

Richard Webb <webby beardmouse.org.uk> writes:

On 02/02/2012 19:32, H. S. Teoh wrote:
 But the day may come when your
 smartphone *can* literally have a 22" monitor (that folds into a pocket
 sized card).


Nah, what you really need are those holographic displays they use on CSI 
Miami. No need for physical displays at all ;-)

Feb 02 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/02/2012 08:32 PM, H. S. Teoh wrote:
 On Thu, Feb 02, 2012 at 08:20:55PM +0100, dsimcha wrote:
 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:
 On 2/2/12, Manu<turkeyman gmail.com>  wrote:
 PC's are an endangered and dying species...

 Kind of like when we got rid of cars and trains and ships once we
 started making jumbo jets.

 Oh wait, that didn't happen.

 Agreed.  I just recently got my first smartphone and I love it.  I see
 it as a complement to a PC, though, not as a substitute.  It's great
 for when I'm on the go, but when I'm at home or at work I like a
 bigger screen, a full keyboard, a faster processor, more memory, etc.
 Of course smartphones will get more powerful but I doubt any will ever
 have dual 22 inch monitors.

 [...]

 Funny you should mention that, a number of years ago I got to know a guy
 who worked in a lab that was researching organic polymer-based
 electronics, which lets you build things like screens and keyboards that
 can be rolled up or folded and put into your pocket. It was still
 experimental technology at the time, though, and I'm not expecting it to
 be publicly available anytime soon. But the day may come when your
 smartphone *can* literally have a 22" monitor (that folds into a pocket
 sized card).


 T

Chances are that region allocator can run on mobile phones just fine at 
the time that happens. ;)

Feb 02 2012

Sean Kelly <sean invisibleduck.org> writes:

Long term I suspect the typical desktop experience will just be a phone or o=
ther mobile device talking wirelessly to input and display peripherals.

On Feb 2, 2012, at 11:20 AM, "dsimcha" <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:
 On 2/2/12, Manu <turkeyman gmail.com> wrote:
 PC's are an endangered and dying species...

=20
 Kind of like when we got rid of cars and trains and ships once we
 started making jumbo jets.
=20
 Oh wait, that didn't happen.

=20
 Agreed.  I just recently got my first smartphone and I love it.  I see it a=

s a complement to a PC, though, not as a substitute.  It's great for when I'=
m on the go, but when I'm at home or at work I like a bigger screen, a full k=
eyboard, a faster processor, more memory, etc.  Of course smartphones will g=
et more powerful but I doubt any will ever have dual 22 inch monitors.

Feb 02 2012

"F i L" <witte2008 gmail.com> writes:

People and, more importantly, professionals will always want the 
most power they can get to do their work. Whether that power 
remains on the client system or is off-set to the home/office 
server or cloud is the big question.

I think the future will see an increase of local servers which 
intelligently allocate resources to transport-friendly modular 
clients devices in both homes and businesses.

Feb 02 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 02, 2012 at 01:22:54PM -0800, Sean Kelly wrote:
 Long term I suspect the typical desktop experience will just be a
 phone or other mobile device talking wirelessly to input and display
 peripherals.

[...]

Funny, it comes full-circle to the good ole days of dumb X terminals
that only handle input/display, and leave the actual computing to the
server. Except now the server is no longer an oversized ceiling-high
machine in the server room, but a tablet you can fit into your pocket.
:)

However, due to the inconvenience of needing to be near a bulky monitor
and keyboard, I suspect rather that the mobile device will come with a
built-in projector and foldable keyboard. The latter already exists; the
former allows the convenience of usability anywhere there's a large flat
surface.


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be
algorithms.

Feb 02 2012

Manu <turkeyman gmail.com> writes:

On 2 February 2012 21:20, dsimcha <dsimcha yahoo.com> wrote:

 On Thursday, 2 February 2012 at 18:55:02 UTC, Andrej Mitrovic wrote:

 On 2/2/12, Manu <turkeyman gmail.com> wrote:

 PC's are an endangered and dying species...

 Kind of like when we got rid of cars and trains and ships once we
 started making jumbo jets.

 Oh wait, that didn't happen.

 Agreed.  I just recently got my first smartphone and I love it.  I see it
 as a complement to a PC, though, not as a substitute.  It's great for when
 I'm on the go, but when I'm at home or at work I like a bigger screen, a
 full keyboard, a faster processor, more memory, etc.  Of course smartphones
 will get more powerful but I doubt any will ever have dual 22 inch monitors.

You're obviously a nerd though, hanging out in a place like this :)

There is no small number of people who use their PC's purely for
communication, facebook, emails, skype, etc. and there are well documented
trends showing more and more people are performing those tasks exclusively
from their phones/portable devices. They are now, already, ex-pc-users.
That trend is accelerating (I wish I could remember where the slides were
_<)

My girlfriend for instance, I bought an iPad to do some dev on... she
confiscated it one day after I showed her the eBook app, which she loved,
and promptly threw away her whole hard copy library...
Funnily though, and completely unconsciously on her part, I haven't seen
her open her computer more than once or twice since that time. It has
everything on it she ever used her computer for, it's more portable and
convenient, fits in her bag (takes it to class), and it even has a recipe
app and conveniently sits in the bench while she's baking :)
I have had the same discussion in the office, and I'm not the only one who
has seem a similar transition first hand.

Most people aren't computer nerds... they don't give 2 shits about
computers, it's just a tool, and the second something more practical or
convenient comes along for their purposes, they'll use that instead.
That time has already past, the trend is accelerating quickly, the snowball
is rolling, even REAL nerds are starting to slip. (I must confess, I'm
seriously tempted by a transformer prime...)


I think the MOST interesting part about this trend though, is that many of
these people who used their laptop/pc *exclusively* for communication have
now discovered these mobile app stores, and they're being exposed to apps
that do all sorts of things. Far more than they ever did on their PC's of
days past. Even games; people who would say they have no interest in video
games at all are being shown to be picking up simple mobile/tablet games at
an alarming rate.
This is a huge and fast growing software market, almost all the software is
written from scratch (read: developers *could* choose D if it were
available), and D should be a part of it!
The language landscape in this space is very turbulent at the moment, but
it will settle, and D needs to be in it before it does, or it will miss the
boat.

Feb 02 2012

Manu <turkeyman gmail.com> writes:

On 2 February 2012 23:22, Sean Kelly <sean invisibleduck.org> wrote:

 Long term I suspect the typical desktop experience will just be a phone or
 other mobile device talking wirelessly to input and display peripherals.


'Long term'? I think you're being a bit pessimistic. It's already here!
http://eee.asus.com/eeepad/transformer-prime/features/ (do want!)

I give it... maybe 2-3 years before it's standard (a very significant
portion of the hardware landscape). I don't think that's very long...
certainly not a lot of time for D to get its shit together! ;)

Feb 02 2012

Manu <turkeyman gmail.com> writes:

On 2 February 2012 20:13, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 On Thursday, February 02, 2012 20:06:14 Manu wrote:
 On 2 February 2012 17:40, dsimcha <dsimcha yahoo.com> wrote:
 On Thursday, 2 February 2012 at 04:38:49 UTC, Robert Jacques wrote:
 An XML parser would probably want some kind of stack segment growth
 schedule, which, IIRC isn't supported by RegionAllocator.

 at least assuming we're targeting PCs and not embedded devices.

 I don't know about the implications of your decision, but comment makes

 me
 feel uneasy.

 I don't know how you can possibly make that assumption? Have you looked
 around at the devices people actually use these days?
 PC's are an endangered and dying species... I couldn't imagine a worse
 assumption if it influences the application of D on different systems.

 PCs are not endangered in the least. It's just that we're getting an
 increase
 in other devices (particularly smart phones and tablets). PCs are _heavily_
 used, and there's no way that smart phones or tablets could replace them.
 They
 do different stuff. It _is_ true that applications are increasingly being
 written for non-PCs, but PCs definitely aren't dying off.

 Also, how much do you really treat smart phones or tablets like embedded
 devices rather than PCs? They're certainly more like PCs than the embedded
 devices of yore. True, they have stricter performance requirements, but
 they're nowhere near as restrictive as they used to be.


They're restrictive devices with weaker hardware, yet people seem to
generally expect a roughly equivalent experience from them as they get from
their PC.
A modern PC's software can be written in any language the programmer likes,
and expect that the result will generally run well. Performance is almost a
'solved' problem on PC's for most applications.

I really just argue that D, a so called systems language [this is surely
D's primary offering/niche, or am I mistaken? It basically competes with
C/C++, and on roughly the same terms], needs to constantly consider (and
potentially even favour) the performance requirements of these weaker
systems. They are the devices with the highest demand for efficiency placed
on them.

For productivity applications, PC's will remain dominant, sure, but
consider the number of productivity applications vs the number of
entertainment products/games/apps/toys. The scope doesn't even compare.
Also consider that productivity software is rarely written from scratch.
Most productivity software people use is already written, only updated now,
and isn't going to restart in D any time soon.

Games/apps/toys, in terms of number of pieces of software developed, and
number of developers writing such software (read: **NEW** software, where
the choice of using a new language like D is a *real* possibility),
completely blows the PC developer base SOOO far out of the water it's
barely quantifiable, probably thousands to 1. That developer base are not
working on x86 platforms, they're targeting games consoles, phones,
tablets... TV's will be big very soon.
They are a large community, grown very tired of C/C++, want a language
that's not shit, and that is exactly as efficient as C/C++ when used
correctly. D fits this bill like a glove, and also has the advantage of
being compatible with their existing libraries.

Embedded hardware toyed with the notion of using managed languages for a
little while, not even offering C toolchains at first (Android, Windows
Phone), but that failed, and both revoked that policy. The bar is raising
fast on those platforms. Everyone producing high end products required
native development to remain competitive.
There's really not a lot of modern native language competition out there, D
is positioned well, and with this fact in mind, D's performance must be as
comparable as possible to C/C++.
In fact, D has potential for many performance *advantages* over C/C++, but
the GC is a key concern for me personally... and if possible to tune it for
these systems, it should be. They require it more, and PC honestly won't
care that much :)

Apologies for once again derailing an unrelated topic! :)
I really just can't handle seeing this sort of presumption constantly
popping up all over the news group. This PC-centric mentality has to go if
D is to have any commercial success. The commercial software industry moved
on years ago, it's not on PC anymore... the D community needs to
become intimately aware of that fact, whether they like it or not.
(presuming of course that the intent here if for D to be successful :)

Feb 02 2012

Richard Webb <richard.webb boldonjames.com> writes:

Out of interest, i just tried loading the same file with std.xml, and 
the performance there is pretty similar in each version, possibly 
slightly slower in 2.058 (~21-22 seconds in each case).

Disabling the GC during the load gets 9 seconds, though task manager 
reports a peak memory usage of almost 600 megabytes in that case!

It looks like most of the time here is spent in Gcxmark whereas with 
xmlp it was in Gcxfullcollect (and fullcollect is the one that is faster 
in 2.058).
The profiler makes it look like things are spending more time in Gcxmark 
than they were before. Is that the case?


I'll try to have a go with Tango when i get some more time.

Feb 02 2012

Michael Rynn <michaelrynn optusnet.com.au> writes:

On Thu, 02 Feb 2012 15:44:56 +0000, Richard Webb wrote:

 


With the xml package (xmlp) , and the linked node DOM, the GC is likely 
to fail cleanup.  I divided the generated test file, with its 2 layer 
elements, into 5, 50, 500, 5000 sized files.  I put in a mixin on the 
Node class to do static counts of Node objects created and deleted.  

I also set the document to be read multiple times, calling GC.collect 
between each one, and timed it, and the increase in read time, and the 
increase in GC.collect time, was strong evidence that the GC was unable 
to clean it up, even if, to my naive code inspection, the previous 
document instance was no longer reachable from GC "roots", ie, should 
have been directly replaced on stack.

The linkdom, rather too faithfully copies the java dom model, with double-
linked pointers.  Elements have an "owner document" (I felt like ditching 
this).  All element children derive from ChildNode (Element, Text, CDATA 
and Comment), and have a pointer to the parent_ node. (could ditch this, 
maybe for CharData).   Anyway, its a pretty linked up conventional model. 
Java GC, evidently has got around problems of GC xml documents. 

When I looked at C code for libxml2, for instance, all the Nodes have 
that sort of interlocked linkage. I tried once to see how far I could get 
(on windows) calling the libxml2 from D.  I did not get very far, and 
gave up after crashes and a headache. 

On the GC cleanup stats, with 5 nodes of the generated test was 
cleanable, but on 50,000, well, the poor GC could not delete a single 
Node object. I've only done this with dmd2_32 on windows. 64 bit may be 
different. Its also Github release of 2.058 - 162, and may not be in 
releasable state for GC.  

The original std.xml seemed to fair best, on GC cleanup, but still 
developed troubles on each larger document size.   std.xmlp.arraydom was 
similar.  Both of these have no back pointers to parents, use an array to 
hold children.

I thought of just nulling the backpointers, and wrote a version of an 
explode method. I tried this with the node version doing, or not doing a 
delete.  With a big document, it currently has to do a delete, or the GC 
will not clean up everything.

I turned the gc stats counters, into a mixin template, and put them on 
the std.xml1, and on the std.arraydom, and any intermediate objects from 
the parsing, that I wanted to confirm were not hanging around. For to 
many of them, I found it necessary to chase them down, and force delete 
(I now prefer explode, because delete implies a complete cleanup, and for 
me, explde means to split into little non-referencing pieces, that the GC 
can manage).

This means I use too many pointer linked structures for the GC. It may be 
just as difficult for RedBlack tree in std.container, I haven't checked.

For me one lesson at least, is, do not assume a set of interlocked 
classes, with circular references, can always be isolated in a single 
full GC collection, and properly deleted. For internal temporaries, that 
will never be seen by external code, it seems a good idea to explode them.

For large complex tree structures, it might be less CPU time to explode 
them, rather than this particular version of GC having to work it out 
with less information. Do explode again.  

After exploding, repeated runs and bigger loads ran optimally.


So I read with interest other posts on mechanisms of GC and how to 
integrate with the compiler to tag classes and stack layout, to know 
pointers precisely. I particularly liked the suggestion for mixin on each 
class, for GC layout. 

What could be done, even adhoc, is to do GC stats on various kinds of 
complex self referring classes, to examine the effectiveness of GC 
collection. This could even be a compiler flag, or version. It was useful 
for me to spot where I could reuse a class rather than reconstruct new. 

A lot of D applications may be run once, and GC doesn't matter so much.
But for long running games, and servers, verified GC behaviour will bring 
more confidence.

Aggressive measurement may allow better detection of where and when the 
GC fails, and allow better assessment of effects of GC code changes.


GC is a tough problem. A hard one to write a whole language based around 
it.

Feb 25 2012

D Programming

C/C++ Programming

Other

digitalmars.D - [xmlp] the recent garbage collector performance improvements