www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Class size and performance

reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
For some code I'm writing, I have a parser that creates instances of a 
class.  For a test case I have, it creates 728,050 instances.  That's a 
lot, and takes a lot of time, but this is life.

I've got a few ideas on optimizing it, but while testing some of them 
out I noticed something interesting.  If I comment out a single, unused 
from commented out code, member... I get double the speed.

The number and types of members do not matter, just their size.

I wrote a reduced test case.  To compile it, use either of these two 
commands:

dmd -version=fast -O -inline -run weird.d
dmd -version=slow -O -inline -run weird.d

The same effect happens with gdc.  It also happens with optimization and 
inlining on or off (they make no difference.)

import std.stdio, std.perf;

class TestClass
{
	ubyte[56] member1;

	version (slow)
		ubyte[4] member2;
}

void main()
{
	// This is just how many my real example used.
	const test_instances = 728_050;

	// Let's store them for accuracy to the problem.
	TestClass[] example = null;
	example.length = test_instances;

	auto counter = new PerformanceCounter();

	counter.start();
	for (int i = 0; i < test_instances; i++)
		example[i] = new TestClass();
	counter.stop();

	PerformanceCounter.interval_type ms = counter.microseconds();

	writefln("%f seconds", cast(double) ms / 1_000_000.0);
}

Also of note, if I put a custom allocator on to determine the size, it's 
slow until I remove another 4 bytes of members.

Is there some sort of threshold in the gc?  Why is this happening?

Thanks,
-[Unknown]
Jan 20 2008
next sibling parent reply Daniel Lewis <murpsoft hotmail.com> writes:
Unknown W. Brackets Wrote:

 For some code I'm writing, I have a parser that creates instances of a 
 class.  For a test case I have, it creates 728,050 instances.  That's a 
 lot, and takes a lot of time, but this is life.
 
 I've got a few ideas on optimizing it, but while testing some of them 
 out I noticed something interesting.  If I comment out a single, unused 
 from commented out code, member... I get double the speed.
 
 The number and types of members do not matter, just their size.
 
 I wrote a reduced test case.  To compile it, use either of these two 
 commands:
 
 dmd -version=fast -O -inline -run weird.d
 dmd -version=slow -O -inline -run weird.d
 
 The same effect happens with gdc.  It also happens with optimization and 
 inlining on or off (they make no difference.)
 
 import std.stdio, std.perf;
 
 class TestClass
 {
 	ubyte[56] member1;
 
 	version (slow)
 		ubyte[4] member2;
 }
 
 void main()
 {
 	// This is just how many my real example used.
 	const test_instances = 728_050;
 
 	// Let's store them for accuracy to the problem.
 	TestClass[] example = null;
 	example.length = test_instances;
 
 	auto counter = new PerformanceCounter();
 
 	counter.start();
 	for (int i = 0; i < test_instances; i++)
 		example[i] = new TestClass();
 	counter.stop();
 
 	PerformanceCounter.interval_type ms = counter.microseconds();
 
 	writefln("%f seconds", cast(double) ms / 1_000_000.0);
 }
 
 Also of note, if I put a custom allocator on to determine the size, it's 
 slow until I remove another 4 bytes of members.
 
 Is there some sort of threshold in the gc?  Why is this happening?
 
 Thanks,
 -[Unknown]

Sir, I'm not sure about the GC, but I can say that your program probably doesn't need to create 728,000 objects to parse an input stream. I'm writing a parser too at the moment; it's a rather experimental, oddly written one though. It [should] perform tokenization and parsing in a single O(1) pass for the ECMAScript language. Take a look if you're interested; http://dsource.org/projects/walnut/browser/branches/1.9/source/interpreter.d Regards, Dan
Jan 20 2008
next sibling parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Daniel Lewis,

Actually, I made a mistake.  That's running the parsing 50 or 75 times 
(I can't remember now.)  So it was actually only like 14,500 objects per 
parsing.

Yes, it's true that most parsing does not require such a large number of 
objects (and a 200 kilobyte file as I'm talking about here really 
shouldn't need so many.)  However, I'm not parsing a configuration file 
or even a command file.  In my case it does really need that many 
objects (this is the purpose of parsing.)

My parser is actually designed to work on an incoming stream, parse on 
the fly in chunks, and according to my benchmarks against similar 
parsers, is quite fast already.

Even so, that is interesting.  My parser works similarly in some ways to 
yours (switches and parseXYZ functions.)

-[Unknown]


Daniel Lewis wrote:
 Sir, I'm not sure about the GC, but I can say that your program probably
doesn't need to create 728,000 objects to parse an input stream.
 
 I'm writing a parser too at the moment; it's a rather experimental, oddly
written one though.  It [should] perform tokenization and parsing in a single
O(1) pass for the ECMAScript language.
 
 Take a look if you're interested;
 http://dsource.org/projects/walnut/browser/branches/1.9/source/interpreter.d
 
 Regards,
 Dan

Jan 20 2008
prev sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
 I'm writing a parser too at the moment; it's a rather experimental, oddly
written one though.  It [should] perform tokenization and parsing in a single
O(1) pass for the ECMAScript language.

I call shenanigans. Your output will be O(n), no? Your parser can't possibly generate O(n) output in O(1) time. --bb
Jan 20 2008
parent reply Daniel Lewis <murpsoft hotmail.com> writes:
Bill Baxter Wrote:

 I'm writing a parser too at the moment; it's a rather experimental, oddly
written one though.  It [should] perform tokenization and parsing in a single
O(1) pass for the ECMAScript language.

I call shenanigans. Your output will be O(n), no? Your parser can't possibly generate O(n) output in O(1) time. --bb

I meant for each given token; same as LL(1) or LALR(k) notation, but yes, it's a misusage of big-O. The idea I was trying to express is this; In most parsers, you run a giant all-you-can-eat tokenizer switch, which returns a token, and then you run some autogenerated parser switch which generates the AST. From there, D generates an IR tree which is much flatter than the original AST, and then interprets that. Mine runs by performing context-sensitive tokenization switches which generate the AST directly from the input stream. It's not working for alot of different cases; there are 26 BUGS mentioned in the comments; but all of them seem solveable from within that paradigm.
Jan 20 2008
parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Yes, this is how mine works and definitely a good solution imho. 
Faster, more efficient, easier to debug, and cleaner.

Personally I am no fan of the auto generated parsers.  But then again, 
it depends on what you're parsing... I'm sure you'll be able to get it 
to parse ECMA cleanly and efficiently.

-[Unknown]


Daniel Lewis wrote:
 Bill Baxter Wrote:
 
 I'm writing a parser too at the moment; it's a rather experimental, oddly
written one though.  It [should] perform tokenization and parsing in a single
O(1) pass for the ECMAScript language.

possibly generate O(n) output in O(1) time. --bb

I meant for each given token; same as LL(1) or LALR(k) notation, but yes, it's a misusage of big-O. The idea I was trying to express is this; In most parsers, you run a giant all-you-can-eat tokenizer switch, which returns a token, and then you run some autogenerated parser switch which generates the AST. From there, D generates an IR tree which is much flatter than the original AST, and then interprets that. Mine runs by performing context-sensitive tokenization switches which generate the AST directly from the input stream. It's not working for alot of different cases; there are 26 BUGS mentioned in the comments; but all of them seem solveable from within that paradigm.

Jan 20 2008
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Unknown W. Brackets wrote:
 For some code I'm writing, I have a parser that creates instances of a
 class.  For a test case I have, it creates 728,050 instances.  That's a
 lot, and takes a lot of time, but this is life.
 
 I've got a few ideas on optimizing it, but while testing some of them
 out I noticed something interesting.  If I comment out a single, unused
 from commented out code, member... I get double the speed.
 
 The number and types of members do not matter, just their size.
 
 I wrote a reduced test case.  To compile it, use either of these two
 commands:
 
 dmd -version=fast -O -inline -run weird.d
 dmd -version=slow -O -inline -run weird.d
 
 The same effect happens with gdc.  It also happens with optimization and
 inlining on or off (they make no difference.)
 
 import std.stdio, std.perf;
 
 class TestClass
 {
     ubyte[56] member1;
 
     version (slow)
         ubyte[4] member2;
 }

Without version(slow), TestClass will occupy 64 bytes, 4 bytes for the vtbl ptr, 4 bytes for the monitor ptr, and 56 bytes for the data. With version(slow), the GC will reserve 128 bytes for the same class, because it will need 68 bytes of storage. Sean
Jan 20 2008
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
Ah, thank you.  That makes a lot of sense.  I thought it was interesting 
that it was near the 64 mark, but didn't realize there were both vtbl 
and monitor pointers.

Sounds like this would probably be resolved by creating a custom 
allocator, which preallocates space for a number of instances in chunks, 
which was already an idea I had to test for optimizing performance (the 
objects themselves will most likely be deleted all at once, or not, so 
this should be a good optimization.)

Thanks.

-[Unknown]


Sean Kelly wrote:
 Without version(slow), TestClass will occupy 64 bytes, 4 bytes for the
 vtbl ptr, 4 bytes for the monitor ptr, and 56 bytes for the data.  With
 version(slow), the GC will reserve 128 bytes for the same class, because
 it will need 68 bytes of storage.
 
 
 Sean

Jan 20 2008
parent reply torhu <no spam.invalid> writes:
Unknown W. Brackets wrote:
 Ah, thank you.  That makes a lot of sense.  I thought it was interesting 
 that it was near the 64 mark, but didn't realize there were both vtbl 
 and monitor pointers.
 
 Sounds like this would probably be resolved by creating a custom 
 allocator, which preallocates space for a number of instances in chunks, 
 which was already an idea I had to test for optimizing performance (the 
 objects themselves will most likely be deleted all at once, or not, so 
 this should be a good optimization.)

One trick Walter uses in DMD is free lists, did you check out that? Obviously it won't work if you need all of the objects allocated at the same time. http://www.digitalmars.com/d/memory.html#freelists
Jan 20 2008
parent reply "Unknown W. Brackets" <unknown simplemachines.org> writes:
That's a good trick, but it's probably not helpful in this case.  My 
problem is, I know that I'm going to have 1000 of these objects. 
There's no reason to allocate each one individually.  Rather I should 
create a pool for them, and shove the instances into the pool.

Free lists are great when I'm going to be trashing objects frequently. 
My problem is I have 1000 objects I need to create, and end with.  And, 
if I ever delete any of those, the largest possibility is that I'm 
deleting all of them.

Since my problem (of this thread) appears to be doubling the size of 
allocation due to passing 64 bytes, this should allow me to better 
manage the memory.  I'm not sure if I want to pack it within my buffer 
or do it every 128 bytes (which might be better alignment), but either 
way less allocations will - I'm hoping - improve speed.

I'll have to play with a few different things.

-[Unknown]


torhu wrote:
 One trick Walter uses in DMD is free lists, did you check out that? 
 Obviously it won't work if you need all of the objects allocated at the 
 same time.
 
 http://www.digitalmars.com/d/memory.html#freelists

Jan 20 2008
parent reply Sean Kelly <sean f4.ca> writes:
Unknown W. Brackets wrote:
 That's a good trick, but it's probably not helpful in this case.  My
 problem is, I know that I'm going to have 1000 of these objects. There's
 no reason to allocate each one individually.  Rather I should create a
 pool for them, and shove the instances into the pool.

In Tango, use GC.sizeOf to tell you the size of the block reserved for the pointer in question, and GC.reserve to pre-allocate a chunk of memory if you have an idea of the high-water mark if your app. GC.reserve is a relatively new addition and can speed up an app tremendously in some cases. Sean
Jan 21 2008
parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:fn1l3g$2v3n$1 digitalmars.com...

 GC.reserve is a relatively new addition and can speed up an app
 tremendously in some cases.

Can't wait for 0.99.5 then ;) (as you know I'm currently allocating large chunks of memory and then deleting them to fake a reserve operation)
Jan 21 2008