www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Refining AA's

reply Manfred Nowak <svv1999 hotmail.com> writes:
As walter stated, the current implementation is good for general 
usage.

But what are the parameters for general usage?

I ask because I just implemented a faster version according to the 
parameters suggested by me.

-manfred
Nov 30 2005
next sibling parent Chris <ctlajoie yahoo.com> writes:
On Wed, 30 Nov 2005 08:02:18 +0000 (UTC), Manfred Nowak
<svv1999 hotmail.com> wrote:
As walter stated, the current implementation is good for general 
usage.

But what are the parameters for general usage?

I ask because I just implemented a faster version according to the 
parameters suggested by me.

-manfred

I believe he said his hashtable was good enough for general use, but it also scales up better than most. I don't recall for sure though, and I haven't searched. Chris
Nov 30 2005
prev sibling next sibling parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 As walter stated, the current implementation is good for general 
 usage.
 
 But what are the parameters for general usage?

I would say it's insertion and lookup of datasets of reaonable size (less than perhaps 50MB as a shot in the dark). For example, I typically use a word occurrence count for AA performance testing, and my "large" dataset is the full text of the bible (as it's the largest non-random text file I could find). Sean
Nov 30 2005
next sibling parent reply pragma <pragma_member pathlink.com> writes:
In article <dmknqb$11aa$1 digitaldaemon.com>, Sean Kelly says...
Manfred Nowak wrote:
 As walter stated, the current implementation is good for general 
 usage.
 
 But what are the parameters for general usage?

I would say it's insertion and lookup of datasets of reaonable size (less than perhaps 50MB as a shot in the dark). For example, I typically use a word occurrence count for AA performance testing, and my "large" dataset is the full text of the bible (as it's the largest non-random text file I could find).

Sean, On a whim, I tried looking for something bigger. As it turns out, "War and Peace" weighs in at 3.13MB (uncompressed ascii) over at Project Gutenberg: http://www.gutenberg.org/etext/2600 But it looses out to the complete King James Bible (4.24 MB): http://www.gutenberg.org/etext/10 Alas, there's no way to search based on the size of an ebook. I also just learned that they support texts in different languages and encodings. The have an "online recoding service" that can change the ASCII into UTF-8 or some random ISO codepage and so forth. Provided their converter is accurate, it could make for some nice test data. - EricAnderton at yahoo
Nov 30 2005
parent reply Sean Kelly <sean f4.ca> writes:
pragma wrote:
 
 On a whim, I tried looking for something bigger. As it turns out, "War and
 Peace" weighs in at 3.13MB (uncompressed ascii) over at Project Gutenberg:
 
 http://www.gutenberg.org/etext/2600
 
 But it looses out to the complete King James Bible (4.24 MB):
 
 http://www.gutenberg.org/etext/10
 
 Alas, there's no way to search based on the size of an ebook.
 
 I also just learned that they support texts in different languages and
 encodings.  The have an "online recoding service" that can change the ASCII
 into UTF-8 or some random ISO codepage and so forth.  Provided their converter
 is accurate, it could make for some nice test data.

Yeah, the King James Bible from Project Gutenberg is my standard test input :-) Interesting about the UTF conversion though. I hadn't realized they offered that feature. Sean
Nov 30 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 the King James Bible from Project Gutenberg is my standard
 test input

Standard wc-example yields about 100 ms spent for the operations in the AA for the King James Bible. What are your criteria for performance, if this is indeed the largest possible test data? -manfred
Dec 02 2005
parent Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 the King James Bible from Project Gutenberg is my standard
 test input

Standard wc-example yields about 100 ms spent for the operations in the AA for the King James Bible. What are your criteria for performance, if this is indeed the largest possible test data?

I was cleaning up some directories about a week ago and lost my AA test in the process. I'll rewrite something when I find some free time. But the metrics I was comparing were memory usage, cache misses, and test execution time, with execution time obviously being the most important factor. I didn't make any attempt to reduce or eliminate spurious memory allocations, though when I re-do the tests I'm going to use a Slice class I wrote in C++ instead of std::string. Sean
Dec 02 2005
prev sibling next sibling parent "Dave" <Dave_member pathlink.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message 
news:dmknqb$11aa$1 digitaldaemon.com...
 Manfred Nowak wrote:
 As walter stated, the current implementation is good for general usage.

 But what are the parameters for general usage?

I would say it's insertion and lookup of datasets of reaonable size (less than perhaps 50MB as a shot in the dark). For example, I typically use a word occurrence count for AA performance testing, and my "large" dataset is the full text of the bible (as it's the largest non-random text file I could find). Sean

That's a good one as a 'stress test', IMO. FWIW, the way I generally use them is at most 100 or so insertions followed by lots and lots of lookups and/or value updates. That would be a good "common case" test, maybe, and also stress the most important part - lookups (since obviously lookups are used for insertions and updates too) and leave the GC out of the results.
Nov 30 2005
prev sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 (as it's the largest non-random text file I could
 find). 

I do not understand, why randomness is not suitable for general usage. -manfred
Dec 02 2005
parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 (as it's the largest non-random text file I could
 find). 

I do not understand, why randomness is not suitable for general usage.

I wanted to see results given a real-world example. Using a massive dataset of randomly generated gibberish, ie: "jkhfdsa jksvldijsd sajasd d asdlk wpejvccx..." seems like it would skew results slightly by reducing the chance of collisions. I suppose a fancier method might use a Markhov generator to create arbitrarily large pseudo-English text to process, but I was only testing casually and didn't want to spend the time writing the code. Sean
Dec 02 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 a fancier method might use a Markhov generator to create
 arbitrarily large pseudo-English text to process

Word counts do not realize anything about a language, except the word frequencies and the corresponding word lengthes. You can get these from the known language institutes and build a test of the size you wish, by randomizing over that frequencies and ... But I still do not understand why word counting any example from a natural language could be a measure for the performance of an AA in the general case. -manfred
Dec 02 2005
parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 a fancier method might use a Markhov generator to create
 arbitrarily large pseudo-English text to process

Word counts do not realize anything about a language, except the word frequencies and the corresponding word lengthes. You can get these from the known language institutes and build a test of the size you wish, by randomizing over that frequencies and ... But I still do not understand why word counting any example from a natural language could be a measure for the performance of an AA in the general case.

Only because it's an example of typical use, so it seems a reasonable metric to use. Sean
Dec 02 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 why word counting any example
 from a natural language could be a measure for the performance
 of an AA in the general case.

Only because it's an example of typical use, so it seems a reasonable metric to use.

What other examples of typical usage do you know and how should the measures taken for those example be weighted? -manfred
Dec 03 2005
parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 why word counting any example
 from a natural language could be a measure for the performance
 of an AA in the general case.

reasonable metric to use.

What other examples of typical usage do you know and how should the measures taken for those example be weighted?

I suppose I should have qualified that. It's an example of how *I* typically use hashtables. For an informal test, I'd rather have one that was meaningful to me. Sean
Dec 03 2005
parent Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 It's an example of how *I* typically use hashtables.

Thx. You suggest a word count for a text taken from a natural language. Dave suggests some examples from the shootout. I suggest an abstract sizeable problem. Walter surely has some more examples. Then how should I build up a scenario consisting of examples individuals personally prefer and weight them to get performance differences most agree upon? -manfred
Dec 03 2005
prev sibling parent reply "Dave" <Dave_member pathlink.com> writes:
"Manfred Nowak" <svv1999 hotmail.com> wrote in message 
news:Xns971E5BF112D92svv1999hotmailcom 63.105.9.61...
 As walter stated, the current implementation is good for general
 usage.

 But what are the parameters for general usage?

 I ask because I just implemented a faster version according to the
 parameters suggested by me.

 -manfred

Attached are the "Shootout Benchmark" tests involving AA's, both past and present. They've been written to minimize GC thrashing so are a good test of just the AA implementation for common AA's, and generally emphasize different parts of the ABI so as to give a good indication of overall performance, I think. Plus, by looking at how D currently does compared to other languages on the Shootout site, we could extrapulate how your new implementation looks against the other languages too. Look over at: http://shootout.alioth.debian.org/ to find out how they are currently run and for input files. Please post your results too. (P.S.: v138 and v139 of phobos were apparently compiled without optimizations and/or with debug symbols, so you need to use v140 for a fair comparision). Thanks, - Dave
Nov 30 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Dave wrote:

[...]
 Please post your results too.

Preliminary results after some tuning of the parameters of my scalable implementation: size ins acc t.W t.me gain 4 16 9999984 1338 1137 -15,02% 7 130 9999870 1345 1136 -15,54% 10 1000 9999000 1395 1147 -17,78% 14 16000 9984000 1554 1140 -26,64% 17 130000 9870000 2092 1256 -39,96% 20 970000 9030000 3354 1763 -47,44% 22 2400000 7600000 7367 2183 -70,37% 24 4900000 5100000 21227 3282 -84,54% 26 8000000 2000000 58396 15691 -73,13% 27 8900000 1100000 108875 51929 -52,30% 28 9400000 600000 112160 193036 72,11%* legend: size:= binary logarithm of the size of the abstract problem ins:= number of insertions into the AA acc:= number of accesses to elements of the AA t.W:= total time in ms needed by Walters implementation t.me:= total time in ms needed by my implementation gain:= relative time gain after subtracting 770 ms, which is the time needed with no accesses to AA's *:= data insufficient comments: 1. My implementation uses more space than Walters. Therefore the time gain starts to stall at about 5M elements inserted and it collapses when about 9M elements are inserted, based on 2GB RAM of my machine. 2. One try with an extreme setting of the parameters of the implementation gave the hint, that a time reduction to about 15% of Walters implementation up to 5M elements might be possible, but dmd needed nearly an hour to produce the executable for that one try :-( 3. According to the answers in the thread my implementation currently is not suitable for general use, because the abstract problem I used was restricted to fixed length keys. I am experimenting with an appropriate extension. -manfred
Dec 02 2005
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 
 3. According to the answers in the thread my implementation 
 currently is not suitable for general use, because the abstract 
 problem I used was restricted to fixed length keys. I am 
 experimenting with an appropriate extension.

My unordered_set implementation doesn't support rehashing yet--I wrote it for use in a compiler and didn't have the time to bother with it. If you have a succinct test example I could run it on I'll give it a shot, but it'll have a bit of an unfair advantage until I get around to finishing it. Sean
Dec 02 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 If you have a succinct test example I could run
 it on I'll give it a shot, but it'll have a bit of an unfair
 advantage until I get around to finishing it.

I dont understand you fully. If you mean to give your implemetation a try on my framweork, here it is: -manfred /+ The abstract problem asks for the time needed for n value changes for keys somehow randomly choosen out of the range [ 0 .. s-1]. The abstract problem has two parameters n and s. n is the total number of operations on the AA. Initially I set it, so that the resulting total time was within a resonable limit, i.e. 60 s. But it turned out, to be also 10% more than the number of elements, that fit into the RAM of the target machine. Therefore the latter aproximation should be choosen. s is the size of the space from which the keys are drawn. It is reasonable to choose s to be a power of 2. The "somehow" in the description above reduces the randomness of the draw according to the well known 80:20 rule, i.e. 80% of the operations are executed on 20% of the key space. Compilation parameters: -version=a run will give the time needed without any accesses to AA -version=aaa run will give the time needed for the current implementation of the AA -version=aa run will give the time needed for the candidate -version=aa -debug=asrt will test the candidates correctness against the current implementation +/ void main(){ alias HighPerformanceCounter Timer; Timer time; time= new Timer; const uint ttotal= 64*1024*1024, // Problem size s tsmall= ttotal/ 5; // i.e. the target space const uint ntotal= 10_000_000, // total number of operations n nsmall= ntotal/4, nlarge= ntotal-nsmall; time.start(); version( aaa){ uint[ uint] aaa; uint r= void; for( int i= 1; i<= nlarge; i++){ r= rand(); aaa[ (r)%tsmall]= i; } for ( int i= 1; i<= nsmall; i++){ r= rand(); aaa[ (r)%ttotal]= i; } writef( aaa.length, " "); } else { version( aa){ AA aa= new AA; debug(asrt)uint[ uint] a; uint r= void; for( int i= 1; i<= nlarge; i++){ r= rand(); aa.insert( (r)%tsmall, i); debug(asrt) a[ (r)%tsmall] =i; debug assert( aa.retrieveRval( (r)%tsmall) == i); debug { writef( a.length , " " , aa.length, " ", (r)%tsmall); for( int d= aa.depth; d>0; d--) writef( " ", aa.position( (r)%tsmall,d)); writefln( ""); } debug assert( a.length == aa.length); debug foreach( uint k, uint v; a){ assert( aa.retrieveRval( k) == v); } } for( int i= 1; i<= nsmall; i++){ r= rand(); aa.insert( (r)%ttotal, i); debug(asrt) a[ (r)%ttotal] =i; debug assert( aa.retrieveRval( (r)%ttotal) == i); debug assert( a.length == aa.length()); debug foreach( uint k, uint v; a){ assert( aa.retrieveRval( k) == v); } } debug(asrt) assert( a.length == aa.length()); debug(asrt) foreach( uint k, uint v; a){ assert( aa.retrieveRval( k) == v); } writef( aa.length(), " "); } else { uint r= void; for( int i= 1; i<= nlarge; i++){ r= rand(); ( (r)%tsmall, i); } for( int i= 1; i<= nsmall; i++){ r= rand(); ( (r)%ttotal, i); } } } time.stop(); writefln( time.milliseconds()); } }
Dec 02 2005
parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 If you have a succinct test example I could run
 it on I'll give it a shot, but it'll have a bit of an unfair
 advantage until I get around to finishing it.

I dont understand you fully. If you mean to give your implemetation a try on my framweork, here it is:

That's what I was asking for :-) I'll try to produce a C++ equivalent this weekend. And for what it's worth, I mocked up a quick word-count test for std::map vs. my AA implementation. Here are the results of a run on a 5+ meg text file (the KJ Bible). Assume my unordered_map implementation is fixed at using 4097 hash buckets: Testing std::map... Execution Time: 1.752000 Unique Words: 61918 Page Faults: 607 Memory Used: 2490368 Testing unordered_map... Execution Time: 0.621000 Unique Words: 61918 Min Bucket Size: 3 Max Bucket Size: 34 Average Bucket Size: 15 Page Faults: 338 Memory Used: 1384448 Not bad at almost 1/3 the time of std::map, and it would be better if I added more buckets. For what it's worth, the main test program is available here: http://home.f4.ca/sean/d/aa_test.cpp Sean
Dec 03 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 Not bad at almost 1/3 the time of std::map

The test looks biased because the burden of allocating memory to the process might lay on the call for std::map. -manfred
Dec 03 2005
parent Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 Not bad at almost 1/3 the time of std::map

The test looks biased because the burden of allocating memory to the process might lay on the call for std::map.

I just tried reversing the tests so the std::map test is performed second, and the results were the same. But as I said earlier, my AA implementation has a somewhat unfair advantage in that it doesn't rehash automatically. Sean
Dec 03 2005
prev sibling next sibling parent reply Dave <Dave_member pathlink.com> writes:
In article <Xns9720E101439BFsvv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Dave wrote:

[...]
 Please post your results too.

Preliminary results after some tuning of the parameters of my scalable implementation: size ins acc t.W t.me gain 4 16 9999984 1338 1137 -15,02% 7 130 9999870 1345 1136 -15,54% 10 1000 9999000 1395 1147 -17,78% 14 16000 9984000 1554 1140 -26,64% 17 130000 9870000 2092 1256 -39,96% 20 970000 9030000 3354 1763 -47,44% 22 2400000 7600000 7367 2183 -70,37% 24 4900000 5100000 21227 3282 -84,54% 26 8000000 2000000 58396 15691 -73,13% 27 8900000 1100000 108875 51929 -52,30% 28 9400000 600000 112160 193036 72,11%* legend: size:= binary logarithm of the size of the abstract problem ins:= number of insertions into the AA acc:= number of accesses to elements of the AA t.W:= total time in ms needed by Walters implementation t.me:= total time in ms needed by my implementation gain:= relative time gain after subtracting 770 ms, which is the time needed with no accesses to AA's *:= data insufficient comments: 1. My implementation uses more space than Walters. Therefore the time gain starts to stall at about 5M elements inserted and it collapses when about 9M elements are inserted, based on 2GB RAM of my machine. 2. One try with an extreme setting of the parameters of the implementation gave the hint, that a time reduction to about 15% of Walters implementation up to 5M elements might be possible, but dmd needed nearly an hour to produce the executable for that one try :-( 3. According to the answers in the thread my implementation currently is not suitable for general use, because the abstract problem I used was restricted to fixed length keys. I am experimenting with an appropriate extension. -manfred

What D implementation (DMD or GDC) and version? Thanks, - Dave
Dec 02 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Dave wrote:
[...]
 What D implementation (DMD or GDC) and version?

Using DMD 0.140 under WinXP32. Times reported are lowest of three consecutive runs. Times reported are the real times reported by the time command of cygwin. Remark: Although WinXP32 is limited to less than 4GB of RAM, the memory management seem to have difficulties to deliver memory chunks, when a single thread needs more than 1 GB, which is far beyond the absolute limit. -manfred
Dec 02 2005
parent Dave <Dave_member pathlink.com> writes:
In article <Xns972166DCE52Asvv1999hotmailcom 63.105.9.61>, Manfred Nowak says...
Dave wrote:
[...]
 What D implementation (DMD or GDC) and version?

Using DMD 0.140 under WinXP32.

Cool - is this something you've actually plugged into internal/aaA.d already to test? If so post the code if you want, and I'll give it a shot with the benchmarks I sent earlier and post the results (I already have the input files handy, etc.). - Dave
Times reported are lowest of three consecutive runs.
Times reported are the real times reported by the time command of 
cygwin.

Remark:
Although WinXP32 is limited to less than 4GB of RAM, the memory 
management seem to have difficulties to deliver memory chunks, when a 
single thread needs more than 1 GB, which is far beyond the absolute 
limit.

-manfred

Dec 02 2005
prev sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Manfred Nowak wrote:

[...]
 Preliminary results after some tuning of the parameters of my 
 scalable implementation:

Some more results for using variable sized keys: P.size Ins Acc t.W t.me Cal Gain 4 16 9999984 4873 4773 3495 -7,26% 5 32 4902 5032 3646 10,35% 6 64 5098 5174 3775 5,74% 7 130 9999870 5237 5410 4000 13,99% 10 1000 9999000 5897 6065 4477 11,83% 11 2000 6271 6398 4733 8,26% 12 4100 6429 6502 4827 4,56% 13 8200 6774 6821 5051 2,73% 14 16000 9984000 7225 7139 5279 -4,42% 17 130000 9870000 13056 8606 5844 -61,70% 20 970000 9030000 17572 10993 6300 -58,37% 22 2400000 7600000 44757 13709 6517 -81,19% 24 4900000 5100000 138197 30409 7011 -82,16% 26 8000000 2000000 245284 165341 7275 -33,59% As one can see, this implementation is up to 10000 elements up to 15% slower and collapses at about 8,200,000 elements. In the range from 20,000 to 8,000,000 elements it is significantly faster. How to interpret this outcome if fine tuning will not bring any advance in the range up to 10,000 elements? -manfred
Dec 04 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Manfred Nowak wrote:
 Preliminary results after some tuning of the parameters of my 
 scalable implementation:

After fixing some bugs the gain table looks like this: size Gain 4 -31,57% 5 -28,42% 6 -29,10% 7 -14,63% 10 -14,15% 11 -22,89% 12 -24,09% 13 -22,29% 14 -24,36% 17 -68,07% 20 -61,68% 22 -82,17% 24 -82,36% 26 -33,69% An additional problem is left for discussion: Does it suffice to rely on the swapping algorithm of the underlaying OS or is it the role of the language to provide a solution when memory gets tight? Usually out-of-memory errors and the like are thrown, but this forces every implementator to not trust the built-in language features and build a custom implementation. -manfred
Dec 04 2005
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 
 An additional problem is left for discussion:
 
 Does it suffice to rely on the swapping algorithm of the underlaying 
 OS or is it the role of the language to provide a solution when 
 memory gets tight?

Language-level functionality should provide the basic or weak exception guarantees, depending on context. Frankly, so should library code. I ran into this problem with my unordered_map implementation--the rehash function is required to provide the weak guarantee (that the contents of the table will be unchanged if an exception is thrown) and since I was relying on using a std::list to hold all the table elements, this meant that I would have to temporarily double memory usage by creating copies of all stored values to ensure I could fall back on the original list if an insertion failed. The alternate would be to use a custom linked-list where I could manipulate node pointers directly and therefore avoid ctor calls for creating/copying stored objects.
 Usually out-of-memory errors and the like are thrown, but this forces 
 every implementator to not trust the built-in language features and 
 build a custom implementation. 

See above. I consider this to be a documentation issue, as there should be explicit guarantees about data structure stability. However, since D objects are accessed via handles, the only issue should be OutOfMemory exceptions, and it's generally not too hard to preallocate needed memory before modifying data. Sean
Dec 04 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Sean Kelly wrote:

[...]
 However, since D objects are accessed via handles,
 the only issue should be OutOfMemory exceptions, and it's
 generally not too hard to preallocate needed memory 
 before modifying data.

I fear that your answer passes the real problem without touching it. I reported, that the runtime collapsed under winXP, when the problem size approached real memory limits. winXP reported 2.3 GB needed in total whereas only 2.0 GB real memory were available. That led to very long times, 20 seconds, winXP was accessing the swapfile, shoveling about 600 MB around each time, interspersed by break outs of CPU activity. No OutOfMemory exception was thrown. Maybe because the threads memory consumption did not reach the 2.0 GB real memory limit. But maybe also, that winXP will report OutOfMemory only, when the virtual address space of 1 TB is reached? In fact I do not remember having had any OutOfMemory exception under winXP and filed a bug report on that issue also. -manfred
Dec 04 2005
parent Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 
 No OutOfMemory exception was thrown. Maybe because the threads memory 
 consumption did not reach the 2.0 GB real memory limit. But maybe 
 also, that winXP will report OutOfMemory only, when the virtual 
 address space of 1 TB is reached?
 
 In fact I do not remember having had any OutOfMemory exception under 
 winXP and filed a bug report on that issue also.

I've noticed this behavior also. In fact, I've gone close to the limits of a fixed-sized swapfile and the OS told me it was going to try and create more virtual memory. Though I didn't track things exactly so I'm not entirely certain if I ever actually exceeded my virtual memory limits or I just passed the 1 GB barrier (my laptop has 512MB RAM and a 768MB swapfile IIRC). Sean
Dec 04 2005
prev sibling parent reply Dave <Dave_member pathlink.com> writes:
In article <Xns9722ABBE9EA2Fsvv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Manfred Nowak wrote:
 Preliminary results after some tuning of the parameters of my 
 scalable implementation:

After fixing some bugs the gain table looks like this:

Fixed key length like the 1st results, or variable like the second? If variable, what does the set look like? Is it a 'word frequency' test similiar to Sean's test? How about some numbers on the shootout benchmarks - Is it too time consuming to run those? No need to instrument them to isolate just the AA operations because it's important and meaningful to see how your implementation runs in the context of the other runtime overhead as well (you mentioned that it consumes quite a bit more memory than the current AA impl). The code was posted publically earlier so everyone (or is it just you, me and Sean interested in this?) in this group could look at the numbers *and* the code and get a better idea of exactly what the benefits of your implementation are for different sets.
size	Gain
4	-31,57%
5	-28,42%
6	-29,10%
7	-14,63%
10	-14,15%
11	-22,89%
12	-24,09%
13	-22,29%
14	-24,36%
17	-68,07%
20	-61,68%
22	-82,17%
24	-82,36%
26	-33,69%

An additional problem is left for discussion:

Does it suffice to rely on the swapping algorithm of the underlaying 
OS or is it the role of the language to provide a solution when 
memory gets tight?

Usually out-of-memory errors and the like are thrown, but this forces 
every implementator to not trust the built-in language features and 
build a custom implementation. 

Does your impl. rely on the GC now? If so, I'd say you can leave these concerns to the language and GC implementors, which would be consistent with the rest of the D RT lib.
-manfred

Dec 04 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Dave wrote:

[...]
 Fixed key length like the 1st results, or variable like the
 second? If variable, what does the set look like? Is it a 'word
 frequency' test similiar to Sean's test?

Variable length keys. The framework used is posted here: digitalmars.D/30849 For the variable length keys test I replaced the assignments ! r= rand(); ! aaa[ (r)%tsmall]= i; with ! r= rand(); ! sprintf( buf, toStringz("%d"), r%tsmall); ! aaa[ toString( buf)]= i; and then adjusted the other calls and declarations. The results of the framework show, that the underlying abstract problem, which is choosen without any respect to a particular implementation, smoothes very well from mostly accesses to mostly insertions.
 How about some numbers on the shootout benchmarks - Is it too
 time consuming to run those? No need to instrument them to
 isolate just the AA operations because it's important and
 meaningful to see how your implementation runs in the context 
 of the other runtime overhead as well (you mentioned that it
 consumes quite a bit more memory than the current AA impl).

I will do shootout benchmarks next. The test runs show, that the average memory consumption increases by about factor 7 while the time consumption decreases up to divisor 5. Therefore even under the optimization criteria S*T**2 it seems to be a gain ;-) [...]
 Does your impl. rely on the GC now? If so, I'd say you can leave
 these concerns to the language and GC implementors, which would
 be consistent with the rest of the D RT lib.

I do not understand this remark. Are you talking about management of the real memory whilst I was asking for the management of the virtual memory? Hmmmm... AA's are the core of management of the virtual memory, aren't they? -manfred
Dec 04 2005
next sibling parent reply Dave <Dave_member pathlink.com> writes:
I do not understand this remark. Are you talking about management of 
the real memory whilst I was asking for the management of the virtual 
memory?

Hmmmm... AA's are the core of management of the virtual memory, 
aren't they?

-manfred 

After reading: digitalmars.D/30905 I'm assuming that you are using the GC for the 'internals' of your implementation, and now I have a better understanding of what you were getting at. IMHO, it is best to let the GC handle these issues (like the rest of the D runtime library implementation), and let a future 'better' allocator/GC system account for exceptional memory conditions. That way the AA implementation will be stay consistent with the rest, along with being simpler to develop, debug and maintain. The current GC might not address the situation you describe very well right now, but that can always be addressed in the future if it becomes a serious problem.
Dec 04 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Dave wrote:

[...]
 The current GC might not address the
 situation you describe very well right now, but that can always
 be addressed in the future if it becomes a serious problem.

After delivering a chunk of memory to the application the GC isnt aware of that memory anymore, except when allocating more memory. Therefore a part of the time consuming swapping might stem from the activities of the GC, but that is the minor part. Every time the application accesses a segment currently swapped out, the OS will need to swap it in again. That is the major reason for swapping and has nothing to do with the GC. The GC does not know anything about the swapping mechanism of the underlying OSa nd therefore cannot be adapted to it. But your argument together with the necessary coupling of the GC with the virtual managemnt looks like every language with a GC must also implement a manager for the virtual memory. -manfred
Dec 04 2005
next sibling parent Sean Kelly <sean f4.ca> writes:
Manfred Nowak wrote:
 
 The GC does not know anything about the swapping mechanism of the 
 underlying OSa nd therefore cannot be adapted to it.

Not strictly true. Some of the fancier GCs use data from the virtual memory manager to avoid paging. Granted, this locks a GC implementation to a specific platform, but it's not unheard of. Sean
Dec 05 2005
prev sibling parent Dave <Dave_member pathlink.com> writes:
In article <Xns97235303CDD71svv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Dave wrote:

[...]
 The current GC might not address the
 situation you describe very well right now, but that can always
 be addressed in the future if it becomes a serious problem.

After delivering a chunk of memory to the application the GC isnt aware of that memory anymore, except when allocating more memory. Therefore a part of the time consuming swapping might stem from the activities of the GC, but that is the minor part. Every time the application accesses a segment currently swapped out, the OS will need to swap it in again. That is the major reason for swapping and has nothing to do with the GC. The GC does not know anything about the swapping mechanism of the underlying OSa nd therefore cannot be adapted to it. But your argument together with the necessary coupling of the GC with the virtual managemnt looks like every language with a GC must also implement a manager for the virtual memory. -manfred

Ok - I gotcha now.. The earlier term 'error' threw me a little bit, but basically you're talking about optimizing with regard to memory swapping because of it's impact on performance, right? I hope so, because no one should have to think this hard right after a cold beer <g> Temporal and spatial locality issues have been a big area GC research for years now from what I gather. Of course, one of the big advantages of generational, compacting collectors is that they help to minimize swapping both during collection and (generally) give better locality for the current working set in the program itself. And even if those may not be the most efficient in terms of CPU instructions, they seem to be proving to be the best for CPU *cycle* efficiency. From what I've read (and seen a bit of, actually), a non-generational, non-compacting GC like the current one for D would be the biggest 'offender' for swapping and VM 'abuse' during collection as it scans over what could be the entire memory space depending on the working set. An AA implementation that pre-allocates a really large amount of memory may actually be defeating the purpose - making the job of the GC harder and actually increasing swapping both during AA accesses and then when the collector kicks in. It's also very possible that designing a virtual memory manager tuned to something specific like an AA implementation could actually be self defeating in the context of a GC, and increase swapping as the VMM and GC compete for locality of the AA memory and live memory for other data in the program. All this to say that I don't think a separate virtual memory management system tuned for one part of the runtime system would really be worth the effort and might be self-defeating. The GC implementation is really the key, IMHO. The AA implementation should really play in the same sandbox as the rest <g> One idea is to make the memory pre-allocation for your AA implementation reverse logoritmic - as the AA set gets larger, the ratio of pre-allocated memory decreases. That should give really good performance for the more common, smaller sets and help to decrease swapping as the sets get larger. Plus it would help out the GC as it deals with other live data in the program. - Dave
Dec 05 2005
prev sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Manfred Nowak wrote:

[...]
 I will do shootout benchmarks next.

First results on hash.d Size t.curr t.me t.cal Gain 4 14,6 16 14,6 7 14,7 16 14,7 10 15,3 17,4 15,2 11 16,7 18,4 15,9 212,50% 12 18,9 21,5 17,2 152,94% 14 32,5 40 26,5 125,00% 15 61 62 40 4,76% 16 108 106 65 -4,65% 17 291 214 123 -45,83% 20 2610 2015 909 -34,98% 22 25094 17466 1917 -32,91% 23 96496 69447 3950 -29,23% 24 >180000 371049 Time for current implementation is not measurable for parameter values less 1000. As already shown, my implementation is faster for larger amounts of stored data, but because increased memory allocation reaches the limits of physical memory earlier and then collapses. Some constants of the implementation can be adjusted so that losses in cases of fewer elements are less, but gains in cases of more lements are also lower. How to decide for the general case? -manfred
Dec 07 2005
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
In article <Xns9725F3778AE4Dsvv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Manfred Nowak wrote:

[...]
 I will do shootout benchmarks next.

First results on hash.d Size t.curr t.me t.cal Gain 4 14,6 16 14,6 7 14,7 16 14,7 10 15,3 17,4 15,2 11 16,7 18,4 15,9 212,50% 12 18,9 21,5 17,2 152,94% 14 32,5 40 26,5 125,00% 15 61 62 40 4,76% 16 108 106 65 -4,65% 17 291 214 123 -45,83% 20 2610 2015 909 -34,98% 22 25094 17466 1917 -32,91% 23 96496 69447 3950 -29,23% 24 >180000 371049 Time for current implementation is not measurable for parameter values less 1000. As already shown, my implementation is faster for larger amounts of stored data, but because increased memory allocation reaches the limits of physical memory earlier and then collapses. How to decide for the general case?

Can't you just make the algorithm change at a certain size? /Oskar
Dec 08 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Oskar Linde wrote:

[...]
 Can't you just make the algorithm change at a certain size?

Maybe, but how to decide that point for the general case? Which measure should be taken? -manfred
Dec 08 2005
parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Hi,

In article <Xns9726A6F56EEE0svv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Oskar Linde wrote:

[...]
 Can't you just make the algorithm change at a certain size?

Maybe, but how to decide that point for the general case? Which measure should be taken?

It's extremly difficult to define a general case. I'd guess that in most cases, AA are used for convenience, neither speed nor size needs to be optimized. Then there are the specific uses with specific needs. It is impossible to make a optimal choice without having the case that should be optimized. The goal for a general implementation should probably be to perform decently in many cases than excellently in few. Performance tuning parameters are easy to change in the future. I'd say, make your best guess at a general usage case, and tune the parameters to this case. Then verify your parameters against a number of different test cases (different hash functions: good/poor, slow/fast, different usage patterns: ratio of insertion, deletes and lookups, different size of key/value data, etc...). Make sure that there are no places where you are considerably worse than a reference implementation. I think you will find that cut-off points between different algorithms are not very sensitive. They would probably be placed in the middle of a range where the algorithms perform similarly. If you by measure mean time vs space I'd say that since hash tables primary feature is time complexity, speed should be more important than memory usage, but there should also be a weighing between those. A 10 % speed increase at the cost of 100 % memory usage is probably not worth it for the general case. (I feel like I've only written vague and obvious things. Apologies for the waste of bandwidth :) Regards, Oskar
Dec 08 2005