digitalmars.D - Refining AA's

Manfred Nowak (6/6) Nov 30 2005 As walter stated, the current implementation is good for general

Chris (6/12) Nov 30 2005 I believe he said his hashtable was good enough for general use, but
Sean Kelly (7/11) Nov 30 2005 I would say it's insertion and lookup of datasets of reaonable size

pragma (13/23) Nov 30 2005 Sean,

Sean Kelly (5/21) Nov 30 2005 Yeah, the King James Bible from Project Gutenberg is my standard test

Manfred Nowak (7/9) Dec 02 2005 [...]

Sean Kelly (9/19) Dec 02 2005 I was cleaning up some directories about a week ago and lost my AA test

Dave (8/18) Nov 30 2005 That's a good one as a 'stress test', IMO.
Manfred Nowak (5/7) Dec 02 2005 I do not understand, why randomness is not suitable for general

Sean Kelly (9/17) Dec 02 2005 I wanted to see results given a real-world example. Using a massive

Manfred Nowak (9/12) Dec 02 2005 Word counts do not realize anything about a language, except the word

Sean Kelly (4/18) Dec 02 2005 Only because it's an example of typical use, so it seems a reasonable

Manfred Nowak (5/11) Dec 03 2005 What other examples of typical usage do you know and how should the

Sean Kelly (5/16) Dec 03 2005 I suppose I should have qualified that. It's an example of how *I*

Manfred Nowak (12/13) Dec 03 2005 [...]

Dave (100/106) Nov 30 2005 Attached are the "Shootout Benchmark" tests involving AA's, both past an...

Manfred Nowak (40/41) Dec 02 2005 Preliminary results after some tuning of the parameters of my

Sean Kelly (7/12) Dec 02 2005 My unordered_set implementation doesn't support rehashing yet--I wrote

Manfred Nowak (110/113) Dec 02 2005 I dont understand you fully. If you mean to give your implemetation

Sean Kelly (24/33) Dec 03 2005 That's what I was asking for :-) I'll try to produce a C++ equivalent

Manfred Nowak (6/7) Dec 03 2005 [...]

Sean Kelly (6/14) Dec 03 2005 I just tried reversing the tests so the std::map test is performed

Dave (5/46) Dec 02 2005 What D implementation (DMD or GDC) and version?

Manfred Nowak (12/13) Dec 02 2005 Using DMD 0.140 under WinXP32.

Dave (6/19) Dec 02 2005 Cool - is this something you've actually plugged into internal/aaA.d alr...

Manfred Nowak (25/27) Dec 04 2005 Some more results for using variable sized keys:

Manfred Nowak (25/27) Dec 04 2005 After fixing some bugs the gain table looks like this:

Sean Kelly (18/27) Dec 04 2005 Language-level functionality should provide the basic or weak exception

Manfred Nowak (16/20) Dec 04 2005 I fear that your answer passes the real problem without touching it.

Sean Kelly (8/16) Dec 04 2005 I've noticed this behavior also. In fact, I've gone close to the limits...

Dave (17/44) Dec 04 2005 Fixed key length like the 1st results, or variable like the second? If v...

Manfred Nowak (31/43) Dec 04 2005 Variable length keys.

Dave (12/18) Dec 04 2005 After reading:

Manfred Nowak (15/18) Dec 04 2005 After delivering a chunk of memory to the application the GC isnt

Sean Kelly (5/8) Dec 05 2005 Not strictly true. Some of the fancier GCs use data from the virtual
Dave (35/53) Dec 05 2005 Ok - I gotcha now.. The earlier term 'error' threw me a little bit, but

Manfred Nowak (28/29) Dec 07 2005 [...]

Oskar Linde (4/29) Dec 08 2005 Can't you just make the algorithm change at a certain size?

Manfred Nowak (5/6) Dec 08 2005 Maybe, but how to decide that point for the general case? Which

Oskar Linde (27/32) Dec 08 2005 Hi,

Manfred Nowak <svv1999 hotmail.com> writes:

As walter stated, the current implementation is good for general 
usage.

But what are the parameters for general usage?

I ask because I just implemented a faster version according to the 
parameters suggested by me.

-manfred

Nov 30 2005

Chris <ctlajoie yahoo.com> writes:

On Wed, 30 Nov 2005 08:02:18 +0000 (UTC), Manfred Nowak
<svv1999 hotmail.com> wrote:
As walter stated, the current implementation is good for general 
usage.

But what are the parameters for general usage?

I ask because I just implemented a faster version according to the 
parameters suggested by me.

-manfred

I believe he said his hashtable was good enough for general use, but
it also scales up better than most. I don't recall for sure though,
and I haven't searched.

Chris

Nov 30 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 As walter stated, the current implementation is good for general 
 usage.
 
 But what are the parameters for general usage?

I would say it's insertion and lookup of datasets of reaonable size 
(less than perhaps 50MB as a shot in the dark).  For example, I 
typically use a word occurrence count for AA performance testing, and my 
"large" dataset is the full text of the bible (as it's the largest 
non-random text file I could find).


Sean

Nov 30 2005

pragma <pragma_member pathlink.com> writes:

In article <dmknqb$11aa$1 digitaldaemon.com>, Sean Kelly says...
Manfred Nowak wrote:
 As walter stated, the current implementation is good for general 
 usage.
 
 But what are the parameters for general usage?

I would say it's insertion and lookup of datasets of reaonable size 
(less than perhaps 50MB as a shot in the dark).  For example, I 
typically use a word occurrence count for AA performance testing, and my 
"large" dataset is the full text of the bible (as it's the largest 
non-random text file I could find).

Sean,

On a whim, I tried looking for something bigger. As it turns out, "War and
Peace" weighs in at 3.13MB (uncompressed ascii) over at Project Gutenberg:

http://www.gutenberg.org/etext/2600

But it looses out to the complete King James Bible (4.24 MB):

http://www.gutenberg.org/etext/10

Alas, there's no way to search based on the size of an ebook.

I also just learned that they support texts in different languages and
encodings.  The have an "online recoding service" that can change the ASCII
into UTF-8 or some random ISO codepage and so forth.  Provided their converter
is accurate, it could make for some nice test data.

- EricAnderton at yahoo

Nov 30 2005

Sean Kelly <sean f4.ca> writes:

pragma wrote:
 
 On a whim, I tried looking for something bigger. As it turns out, "War and
 Peace" weighs in at 3.13MB (uncompressed ascii) over at Project Gutenberg:
 
 http://www.gutenberg.org/etext/2600
 
 But it looses out to the complete King James Bible (4.24 MB):
 
 http://www.gutenberg.org/etext/10
 
 Alas, there's no way to search based on the size of an ebook.
 
 I also just learned that they support texts in different languages and
 encodings.  The have an "online recoding service" that can change the ASCII
 into UTF-8 or some random ISO codepage and so forth.  Provided their converter
 is accurate, it could make for some nice test data.

Yeah, the King James Bible from Project Gutenberg is my standard test 
input :-)  Interesting about the UTF conversion though.  I hadn't 
realized they offered that feature.


Sean

Nov 30 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 the King James Bible from Project Gutenberg is my standard
 test input

[...]

Standard wc-example yields about 100 ms spent for the operations in 
the AA for the King James Bible. What are your criteria for 
performance, if this is indeed the largest possible test data?

-manfred

Dec 02 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 the King James Bible from Project Gutenberg is my standard
 test input

 [...]
 
 Standard wc-example yields about 100 ms spent for the operations in 
 the AA for the King James Bible. What are your criteria for 
 performance, if this is indeed the largest possible test data?

I was cleaning up some directories about a week ago and lost my AA test 
in the process.  I'll rewrite something when I find some free time.  But 
the metrics I was comparing were memory usage, cache misses, and test 
execution time, with execution time obviously being the most important 
factor.  I didn't make any attempt to reduce or eliminate spurious 
memory allocations, though when I re-do the tests I'm going to use a 
Slice class I wrote in C++ instead of std::string.


Sean

Dec 02 2005

"Dave" <Dave_member pathlink.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message 
news:dmknqb$11aa$1 digitaldaemon.com...
 Manfred Nowak wrote:
 As walter stated, the current implementation is good for general usage.

 But what are the parameters for general usage?

 I would say it's insertion and lookup of datasets of reaonable size (less 
 than perhaps 50MB as a shot in the dark).  For example, I typically use a 
 word occurrence count for AA performance testing, and my "large" dataset 
 is the full text of the bible (as it's the largest non-random text file I 
 could find).


 Sean

That's a good one as a 'stress test', IMO.

FWIW, the way I generally use them is at most 100 or so insertions followed 
by lots and lots of lookups and/or value updates. That would be a good 
"common case" test, maybe, and also stress the most important part - lookups 
(since obviously lookups are used for insertions and updates too) and leave 
the GC out of the results.

Nov 30 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 (as it's the largest non-random text file I could
 find). 

I do not understand, why randomness is not suitable for general 
usage.

-manfred

Dec 02 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 (as it's the largest non-random text file I could
 find). 

 
 I do not understand, why randomness is not suitable for general 
 usage.

I wanted to see results given a real-world example.  Using a massive 
dataset of randomly generated gibberish, ie: "jkhfdsa jksvldijsd sajasd 
d asdlk wpejvccx..." seems like it would skew results slightly by 
reducing the chance of collisions.  I suppose a fancier method might use 
a Markhov generator to create arbitrarily large pseudo-English text to 
process, but I was only testing casually and didn't want to spend the 
time writing the code.


Sean

Dec 02 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 a fancier method might use a Markhov generator to create
 arbitrarily large pseudo-English text to process

Word counts do not realize anything about a language, except the word 
frequencies and the corresponding word lengthes. You can get these 
from the known language institutes and build a test of the size you 
wish, by randomizing over that frequencies and ...

But I still do not understand why word counting any example from a 
natural language could be a measure for the performance of an AA in 
the general case.

-manfred

Dec 02 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 a fancier method might use a Markhov generator to create
 arbitrarily large pseudo-English text to process

 
 Word counts do not realize anything about a language, except the word 
 frequencies and the corresponding word lengthes. You can get these 
 from the known language institutes and build a test of the size you 
 wish, by randomizing over that frequencies and ...
 
 But I still do not understand why word counting any example from a 
 natural language could be a measure for the performance of an AA in 
 the general case.

Only because it's an example of typical use, so it seems a reasonable 
metric to use.


Sean

Dec 02 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 why word counting any example
 from a natural language could be a measure for the performance
 of an AA in the general case.

 
 Only because it's an example of typical use, so it seems a
 reasonable metric to use.

What other examples of typical usage do you know and how should the 
measures taken for those example be weighted?

-manfred

Dec 03 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 why word counting any example
 from a natural language could be a measure for the performance
 of an AA in the general case.

 Only because it's an example of typical use, so it seems a
 reasonable metric to use.

 
 What other examples of typical usage do you know and how should the 
 measures taken for those example be weighted?

I suppose I should have qualified that.  It's an example of how *I* 
typically use hashtables.  For an informal test, I'd rather have one 
that was meaningful to me.


Sean

Dec 03 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 It's an example of how *I* typically use hashtables.

[...]

Thx.

You suggest a word count for a text taken from a natural language.
Dave suggests some examples from the shootout.
I suggest an abstract sizeable problem.
Walter surely has some more examples.


Then how should I build up a scenario consisting of examples 
individuals personally prefer and weight them to get performance 
differences most agree upon?

-manfred

Dec 03 2005

"Dave" <Dave_member pathlink.com> writes:

"Manfred Nowak" <svv1999 hotmail.com> wrote in message 
news:Xns971E5BF112D92svv1999hotmailcom 63.105.9.61...
 As walter stated, the current implementation is good for general
 usage.

 But what are the parameters for general usage?

 I ask because I just implemented a faster version according to the
 parameters suggested by me.

 -manfred

Attached are the "Shootout Benchmark" tests involving AA's, both past and 
present. They've been written to minimize GC thrashing so are a good test of 
just the AA implementation for common AA's, and generally emphasize 
different parts of the ABI so as to give a good indication of overall 
performance, I think.

Plus, by looking at how D currently does compared to other languages on the 
Shootout site, we could extrapulate how your new implementation looks 
against the other languages too.

Look over at: http://shootout.alioth.debian.org/ to find out how they are 
currently run and for input files.

Please post your results too.

(P.S.: v138 and v139 of phobos were apparently compiled without 
optimizations and/or with debug symbols, so you need to use v140 for a fair 
comparision).

Thanks,

- Dave


begin 666 aa_test.zip


M+QZ:V;QY\[Z93D=XSPG/CA3CT995P^3PHDS6J(SPEEO+MF$/0,Y<S:?3^BB%
MJM"6\S"EM58FM"Z;>K]LN]TN7*NLH'W7^DYMDYUG>^?&F ]R3"G88HGU'DOU
M07 JE&3:WE%6NJ YTC)%\(I FT(;0N"H(%43<E7+!MY(`G596<>H.0WETW9R
M?#IMLDC:AE$J;8:;7+DXB1,HE]6^]]5RT'8-%IT6%F0RSG&/&1Z V.IA*\>S
MQ,<<LZC;'5W,[4W2O 42G6+B/C_!ZFC<6C=LX[7$SR(I=PL8J>.QCQY^6D"X


M`` `Y6Y^,_A3"'NP`0``/P,```\```!A85]T97-T+VAA<V R+F1MD4V+VS 0
MAN_^%4- P=XXBMU#H7'3'KJTET(/[2% 1%&BB:VM+1E93  E_>T=^2,Q;(2-
MAYEGYGU'7C_#KQ+AFT7AX(NIF\ZAA>]"%YTH$'Z6QCC3N0``2N>:S7K=CBDF

MPM=*T$PS$'6C*MR K"6L?L!*Z4IIA)7%"D6+4(JV?,=D\$P35=T8ZZ!UDM&K
M3#R&5NDBH[)V4 NEPT,I;,YS#L(6;13\]4+ JQJV?8Y5J M7PB=(X3,(9U3H


MF7-_`73^X*4E8)<P']T-\W'>251CG:(.1^*M>] /G1.(XE#V\&L\*GO1K%>.
M;N"]Q9]=FE.9PW+;&\A?^7W#ZWR5LU4.CY4.
MGJ?3!V Z1Z/QUUETG=6T=W -_ -02P,$% ```` `Y6Y^,],KMALG! ``E L`
M`!4```!A85]T97-T+VMN=6-L96]T:61E+F255MUOVS80?_=?<?$P5(H=V>F&
M`HN;#EE:9T73%E "["'S`RU1-F&)U$C*65;D?]\=*5F2K23K/5C2W>\^>%_T
MY!ANUQRN-&<6+E5>E)9KN&9R5;(5AYNU4E:5= ``:VN+L\G$5*R(94+9=93P
MI6 R4GHU&1#L(ELI+>PZA^ >GY9+$!(N?P A5M)JL40/"2P?X*-A+(8K53XX
MO4LEMUR3S"I , %56)&+?Y&1* WO2>4]VW*89PQ]*M*),6"1\3-(\ 1.OL*)




MDU$P_.^2RY [R/V:O!Z9B*LT"$/'\QDCJK+FJG,.!MVPY!H_` JB\DXDTH P
M=5 [?F.I0AU1,L,.MXLAJA)^).*\<';OIE'T9C$>OKO]_8\/'X8MQS51DPI9
M\J[DL?/%L7XO>*;S0HR^O=?%H2,\`\K/X=6[5R$L,1^;0PPY\L C!,X0B!6B
MJ>(!CF!18'N0_3!\*EK_]K K8LRRN,R8Y>TRIMK747#C<)^^-+*-K/JLQ0PP
M"*MNW(0$M>^-C/ZDP.:-L>#T&=GKNK5JX:4JI0V&5U>WPSVU1G+QC.CV.;U;

M>$ K\3L`M E;8E"4RTS$WKQ="Q-4/LS^B-6^:+Z:?MC YW36;H.M$ D<U(-<

M/NI4"X(&Z" "YW-T[BRV)J$3%A;2]L?E8C&Z$TN$IDI\&+PZ<'7A/67X,_&A


M3E5G*FA'MK\BB7IB&M8+XV!V_.:HRM8D>:[9*N?2?M_HU%K=VY (>^$8"CQ!
M#YK
M0F =>4EG\IQ&JEG.W5!6KV]Q`?K7T2B$3[6'A\#Q>BK3 I!!^I.!=]3<H?<"
M<?N.&KK:FO7(G&"((S ]C$X N&UPAIRW(.F!NV;SU#IH%>BN]G4GHDB,-HO%


M S=^2:PZ=F2_$*&*_O;Z8[7=KDH.D3(S[[V9R?P$?G0(7QT* B^V'T9"!TMA
MVE&T"/>=M61'* " (QHNYG._A;C0RE+')6Z4,-RZ=E[LR:9IXAO1:GQ)U&]I
MQW8WFY6U-<_H""60A05L7F AGA%NM  [;5;T ])X`;*7,+N#F3):&8290XW"
M(_ !M:X[K']R69R$M:H?K"/P)'GM*:3JJZ((ASR!, 3?KYZ6WVZOE]>W< GO
MWG\,9(1[H4S)BE_Q9O!!N*H[X5;K=?R6HDIXAF#W;,;&A2T&ITS]W;VN4C[X


MG\[96X/VC<'PWT_W!K<)PZ'_)4R_^3+:.$A9[22J*8_*I%,F-,)8O,TGIPB3
M/%)LOX]([_K( #HHR"&-SL!95;P6?P!02P,$% ```` `Y6Y^,]]('XE+` ``
M' 4``!(```!A85]T97-T+W=O<F1F<F5Q+F2%5$UO&C$0O?,K1I$2UNQB2%6E
M*INDZH?:2Z4>6O70%0>#O:R5Q=YZ34C:\-\[8V\H"ZIJ"6'[O7GS9A S&<&W
M2L$GIX2']W;=;+QR\%F8U4:L%'RMK/5VXP<`4'G?S":3MKOBHM;65URJA1:&

M5;"U3I9._>1R,$)%O6ZL\]!ZR9>M1YOK+!SPHVV.N/&P%MHD;/";]"E_ZV%9
M"5>\G+Z^FM/=8E.6RN4!QX BH/,`A44YVPA'"'J+8+ !H[81OKR:[[7 >!GT



M-)U',E9/5M^$?0H)'<;DC\&,[OHJN]Y)U3AN:*QS\K\*C!]!\UW4F#C44DPY
MCZ%S!.,,L5/7F("B&"2C\)VFIYS )  41\I<;AHJ]?(TJ&L5S<$_BHR[W>!D
M[J7PHC?N^+(.!_SP-TEA/^XX?DHLJZ0+(E86WL"]J/.N`4<32R *H7C;.#R4
M":;*SLY?23CGH_8LH] LQ![TCOP52*1&8/2^"Y3GL*8C0W?J,0^Q''<M;_&/
M`]_9O7*M.O:U==JKLC8),EE/TRF_<8::NAO\`5!+`P0*``````#F;GXS````
M````````````" ```&%A7W1E<W0O4$L!`A0`% ```` `Y6Y^,U+;CO5B`0``
M; (```X``````````0` `````````&%A7W1E<W0O:&%S:"YD4$L!`A0`% ``
M`` `Y6Y^,_A3"'NP`0``/P,```\``````````0` ````C $``&%A7W1E<W0O



M96-K+F102P$"% `4````" #E;GXSWT ?B4L"```>!0``$ `````````!`" `
M``"5"0``86%?=&5S="]W;W)D9G)E<2YD4$L!`A0`" ``````YFY^,P``````
M`````````` ````````````0````$ P``&%A7W1E<W0O4$L%! `````&``8`

`
end

Nov 30 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Dave wrote:

[...]
 Please post your results too.

Preliminary results after some tuning of the parameters of my 
scalable implementation:

size	ins    	acc    	t.W    	t.me    	gain	
4	16	    	9999984	1338    	1137    	-15,02%	
7	130	    	9999870	1345    	1136		-15,54%	
10	1000	    	9999000	1395    	1147		-17,78%	
14	16000	    	9984000	1554    	1140		-26,64%	
17	130000	9870000	2092    	1256		-39,96%	
20	970000	9030000	3354    	1763		-47,44%	
22	2400000	7600000	7367    	2183		-70,37%	
24	4900000	5100000	21227    	3282		-84,54%	
26	8000000	2000000	58396    	15691		-73,13%	
27	8900000	1100000	108875	51929		-52,30%	
28	9400000	600000	112160	193036	 72,11%*

legend:
size:= binary logarithm of the size of the abstract problem
ins:= number of insertions into the AA
acc:= number of accesses to elements of the AA
t.W:= total time in ms  needed by Walters implementation
t.me:= total time in ms needed by my implementation
gain:= relative time gain after subtracting 770 ms, which is the 
time needed with no accesses to AA's
*:= data insufficient

comments:

1. My implementation uses more space than Walters. Therefore the 
time gain starts to stall at about 5M elements inserted and it 
collapses when about 9M elements are inserted, based on 2GB RAM of 
my machine.

2. One try with an extreme setting of the parameters of the 
implementation gave the hint, that a time reduction to about 15% of 
Walters implementation up to 5M elements might be possible, but dmd 
needed nearly an hour to produce the executable for that one try
:-(

3. According to the answers in the thread my implementation 
currently is not suitable for general use, because the abstract 
problem I used was restricted to fixed length keys. I am 
experimenting with an appropriate extension.

-manfred

Dec 02 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 
 3. According to the answers in the thread my implementation 
 currently is not suitable for general use, because the abstract 
 problem I used was restricted to fixed length keys. I am 
 experimenting with an appropriate extension.

My unordered_set implementation doesn't support rehashing yet--I wrote 
it for use in a compiler and didn't have the time to bother with it.  If 
you have a succinct test example I could run it on I'll give it a shot, 
but it'll have a bit of an unfair advantage until I get around to 
finishing it.


Sean

Dec 02 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 If you have a succinct test example I could run
 it on I'll give it a shot, but it'll have a bit of an unfair
 advantage until I get around to finishing it.

I dont understand you fully. If you mean to give your implemetation 
a try on my framweork, here it is:

-manfred

/+
   The abstract problem asks for the time needed for n value
   changes for keys somehow randomly choosen out of
   the range [ 0 .. s-1].

   The abstract problem has two parameters n and s.

   n is the total number of operations on the AA.
   Initially I set it, so that the resulting total time
   was within a resonable limit, i.e. 60 s. But it turned out,
   to be also 10% more than the number of elements, that fit into
   the RAM of the target machine. Therefore the latter aproximation
   should be choosen.
  
   s is the size of the space from which the keys are drawn. It is
   reasonable to choose s to be a power of 2.

   The "somehow" in the description above
   reduces the randomness of the draw according to the
   well known 80:20 rule, i.e. 80% of the operations are executed
   on 20% of the key space.

   Compilation parameters:
   -version=a
     run will give the time needed without any accesses to AA
   -version=aaa
     run will give the time needed for
     the current implementation of the AA
   -version=aa
     run will give the time needed for
     the candidate
   -version=aa -debug=asrt
     will test the candidates correctness
     against the current implementation
+/
void main(){
  alias HighPerformanceCounter Timer;
  Timer time;

  time= new Timer;
  const uint ttotal= 64*1024*1024, // Problem size s
             tsmall= ttotal/ 5;    // i.e. the target space
  const uint ntotal= 10_000_000,   // total number of operations n
             nsmall= ntotal/4,
             nlarge= ntotal-nsmall; 
  time.start();
    version( aaa){
      uint[ uint] aaa;
      uint r= void;
      for( int i= 1; i<= nlarge; i++){
        r= rand();
	aaa[ (r)%tsmall]= i;
      }
      for ( int i= 1; i<= nsmall; i++){
        r= rand();
	aaa[ (r)%ttotal]= i;
      }
      writef( aaa.length, " ");
    } else {
      version( aa){
	AA aa= new AA;
        debug(asrt)uint[ uint] a;
        uint r= void;
	for( int i= 1; i<= nlarge; i++){
          r= rand();
	  aa.insert( (r)%tsmall, i);
          debug(asrt) a[ (r)%tsmall] =i;
          debug assert( aa.retrieveRval( (r)%tsmall) == i);
          debug {
            writef( a.length ,  " " , aa.length, " ", (r)%tsmall);
            for( int d= aa.depth; d>0; d--)
              writef( " ", aa.position( (r)%tsmall,d));
            writefln( "");
          }
          debug assert( a.length == aa.length);
          debug foreach( uint k, uint v; a){
            assert( aa.retrieveRval( k) == v);
          }
        }
	for( int i= 1; i<= nsmall; i++){
          r= rand();
	     aa.insert( (r)%ttotal, i);
          debug(asrt) a[ (r)%ttotal] =i;
          debug assert( aa.retrieveRval( (r)%ttotal) == i);
          debug assert( a.length == aa.length());
          debug foreach( uint k, uint v; a){
            assert( aa.retrieveRval( k) == v);
          }
        }
        debug(asrt) assert( a.length == aa.length());
        debug(asrt)  foreach( uint k, uint v; a){
          assert( aa.retrieveRval( k) == v);
        }
        writef( aa.length(), " ");
      } else {
        uint r= void;
	   for( int i= 1; i<= nlarge; i++){
          r= rand();
	    ( (r)%tsmall, i);
        }
	   for( int i= 1; i<= nsmall; i++){
          r= rand();
	     ( (r)%ttotal, i);
        }
      }
    }
  time.stop();
  writefln( time.milliseconds());
  }
}

Dec 02 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 If you have a succinct test example I could run
 it on I'll give it a shot, but it'll have a bit of an unfair
 advantage until I get around to finishing it.

 
 I dont understand you fully. If you mean to give your implemetation 
 a try on my framweork, here it is:

That's what I was asking for :-)  I'll try to produce a C++ equivalent 
this weekend.  And for what it's worth, I mocked up a quick word-count 
test for std::map vs. my AA implementation.  Here are the results of a 
run on a 5+ meg text file (the KJ Bible).  Assume my unordered_map 
implementation is fixed at using 4097 hash buckets:

Testing std::map...
Execution Time: 1.752000
Unique Words: 61918
Page Faults: 607
Memory Used: 2490368

Testing unordered_map...
Execution Time: 0.621000
Unique Words: 61918
Min Bucket Size: 3
Max Bucket Size: 34
Average Bucket Size: 15
Page Faults: 338
Memory Used: 1384448

Not bad at almost 1/3 the time of std::map, and it would be better if I 
added more buckets.  For what it's worth, the main test program is 
available here:

http://home.f4.ca/sean/d/aa_test.cpp


Sean

Dec 03 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 Not bad at almost 1/3 the time of std::map

[...]

The test looks biased because the burden of allocating memory to the 
process might lay on the call for std::map.

-manfred

Dec 03 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 Sean Kelly wrote:
 
 [...]
 Not bad at almost 1/3 the time of std::map

 [...]
 
 The test looks biased because the burden of allocating memory to the 
 process might lay on the call for std::map.

I just tried reversing the tests so the std::map test is performed 
second, and the results were the same.  But as I said earlier, my AA 
implementation has a somewhat unfair advantage in that it doesn't rehash 
automatically.


Sean

Dec 03 2005

Dave <Dave_member pathlink.com> writes:

In article <Xns9720E101439BFsvv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Dave wrote:

[...]
 Please post your results too.

Preliminary results after some tuning of the parameters of my 
scalable implementation:

size	ins    	acc    	t.W    	t.me    	gain	
4	16	    	9999984	1338    	1137    	-15,02%	
7	130	    	9999870	1345    	1136		-15,54%	
10	1000	    	9999000	1395    	1147		-17,78%	
14	16000	    	9984000	1554    	1140		-26,64%	
17	130000	9870000	2092    	1256		-39,96%	
20	970000	9030000	3354    	1763		-47,44%	
22	2400000	7600000	7367    	2183		-70,37%	
24	4900000	5100000	21227    	3282		-84,54%	
26	8000000	2000000	58396    	15691		-73,13%	
27	8900000	1100000	108875	51929		-52,30%	
28	9400000	600000	112160	193036	 72,11%*

legend:
size:= binary logarithm of the size of the abstract problem
ins:= number of insertions into the AA
acc:= number of accesses to elements of the AA
t.W:= total time in ms  needed by Walters implementation
t.me:= total time in ms needed by my implementation
gain:= relative time gain after subtracting 770 ms, which is the 
time needed with no accesses to AA's
*:= data insufficient

comments:

1. My implementation uses more space than Walters. Therefore the 
time gain starts to stall at about 5M elements inserted and it 
collapses when about 9M elements are inserted, based on 2GB RAM of 
my machine.

2. One try with an extreme setting of the parameters of the 
implementation gave the hint, that a time reduction to about 15% of 
Walters implementation up to 5M elements might be possible, but dmd 
needed nearly an hour to produce the executable for that one try
:-(

3. According to the answers in the thread my implementation 
currently is not suitable for general use, because the abstract 
problem I used was restricted to fixed length keys. I am 
experimenting with an appropriate extension.

-manfred

What D implementation (DMD or GDC) and version?

Thanks,

- Dave

Dec 02 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Dave wrote:
[...]
 What D implementation (DMD or GDC) and version?

Using DMD 0.140 under WinXP32.
Times reported are lowest of three consecutive runs.
Times reported are the real times reported by the time command of 
cygwin.

Remark:
Although WinXP32 is limited to less than 4GB of RAM, the memory 
management seem to have difficulties to deliver memory chunks, when a 
single thread needs more than 1 GB, which is far beyond the absolute 
limit.

-manfred

Dec 02 2005

Dave <Dave_member pathlink.com> writes:

In article <Xns972166DCE52Asvv1999hotmailcom 63.105.9.61>, Manfred Nowak says...
Dave wrote:
[...]
 What D implementation (DMD or GDC) and version?

Using DMD 0.140 under WinXP32.

Cool - is this something you've actually plugged into internal/aaA.d already to
test?

If so post the code if you want, and I'll give it a shot with the benchmarks I
sent earlier and post the results (I already have the input files handy, etc.).

- Dave

Times reported are lowest of three consecutive runs.
Times reported are the real times reported by the time command of 
cygwin.

Remark:
Although WinXP32 is limited to less than 4GB of RAM, the memory 
management seem to have difficulties to deliver memory chunks, when a 
single thread needs more than 1 GB, which is far beyond the absolute 
limit.

-manfred

Dec 02 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Manfred Nowak wrote:

[...]
 Preliminary results after some tuning of the parameters of my 
 scalable implementation:

Some more results for using variable sized keys:

P.size Ins Acc t.W t.me Cal                 Gain
4 16 9999984 4873 4773 3495               -7,26%
5 32 4902 5032 3646                       10,35%
6 64 5098 5174 3775                        5,74%
7 130 9999870 5237 5410 4000              13,99%
10 1000 9999000 5897 6065 4477            11,83%
11 2000 6271 6398 4733                     8,26%
12 4100 6429 6502 4827                     4,56%
13 8200 6774 6821 5051                     2,73%
14 16000 9984000 7225 7139 5279           -4,42%
17 130000 9870000 13056 8606 5844        -61,70%
20 970000 9030000 17572 10993 6300       -58,37%
22 2400000 7600000 44757 13709 6517      -81,19%
24 4900000 5100000 138197 30409 7011     -82,16%
26 8000000 2000000 245284 165341 7275    -33,59%


As one can see, this implementation is up to 10000 elements up to 15% 
slower and collapses at about 8,200,000 elements.

In the range from 20,000 to 8,000,000 elements it is significantly 
faster.

How to interpret this outcome if fine tuning will not bring any 
advance in the range up to 10,000 elements?

-manfred

Dec 04 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Manfred Nowak wrote:
 Preliminary results after some tuning of the parameters of my 
 scalable implementation:

After fixing some bugs the gain table looks like this:

size	Gain
4	-31,57%
5	-28,42%
6	-29,10%
7	-14,63%
10	-14,15%
11	-22,89%
12	-24,09%
13	-22,29%
14	-24,36%
17	-68,07%
20	-61,68%
22	-82,17%
24	-82,36%
26	-33,69%

An additional problem is left for discussion:

Does it suffice to rely on the swapping algorithm of the underlaying 
OS or is it the role of the language to provide a solution when 
memory gets tight?

Usually out-of-memory errors and the like are thrown, but this forces 
every implementator to not trust the built-in language features and 
build a custom implementation. 

-manfred

Dec 04 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 
 An additional problem is left for discussion:
 
 Does it suffice to rely on the swapping algorithm of the underlaying 
 OS or is it the role of the language to provide a solution when 
 memory gets tight?

Language-level functionality should provide the basic or weak exception 
guarantees, depending on context.  Frankly, so should library code.  I 
ran into this problem with my unordered_map implementation--the rehash 
function is required to provide the weak guarantee (that the contents of 
the table will be unchanged if an exception is thrown) and since I was 
relying on using a std::list to hold all the table elements, this meant 
that I would have to temporarily double memory usage by creating copies 
of all stored values to ensure I could fall back on the original list if 
an insertion failed.  The alternate would be to use a custom linked-list 
where I could manipulate node pointers directly and therefore avoid ctor 
  calls for creating/copying stored objects.

 Usually out-of-memory errors and the like are thrown, but this forces 
 every implementator to not trust the built-in language features and 
 build a custom implementation. 

See above.  I consider this to be a documentation issue, as there should 
be explicit guarantees about data structure stability.  However, since D 
objects are accessed via handles, the only issue should be OutOfMemory 
exceptions, and it's generally not too hard to preallocate needed memory 
  before modifying data.


Sean

Dec 04 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Sean Kelly wrote:

[...]
 However, since D objects are accessed via handles,
 the only issue should be OutOfMemory exceptions, and it's
 generally not too hard to preallocate needed memory 
 before modifying data.

I fear that your answer passes the real problem without touching it.

I reported, that the runtime collapsed under winXP, when the problem 
size approached real memory limits. winXP reported 2.3 GB needed in 
total whereas only 2.0 GB real memory were available.

That led to very long times, 20 seconds, winXP was accessing the 
swapfile, shoveling about 600 MB around each time, interspersed by 
break outs of CPU activity.

No OutOfMemory exception was thrown. Maybe because the threads memory 
consumption did not reach the 2.0 GB real memory limit. But maybe 
also, that winXP will report OutOfMemory only, when the virtual 
address space of 1 TB is reached?

In fact I do not remember having had any OutOfMemory exception under 
winXP and filed a bug report on that issue also.

-manfred

Dec 04 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 
 No OutOfMemory exception was thrown. Maybe because the threads memory 
 consumption did not reach the 2.0 GB real memory limit. But maybe 
 also, that winXP will report OutOfMemory only, when the virtual 
 address space of 1 TB is reached?
 
 In fact I do not remember having had any OutOfMemory exception under 
 winXP and filed a bug report on that issue also.

I've noticed this behavior also.  In fact, I've gone close to the limits 
of a fixed-sized swapfile and the OS told me it was going to try and 
create more virtual memory.  Though I didn't track things exactly so I'm 
not entirely certain if I ever actually exceeded my virtual memory 
limits or I just passed the 1 GB barrier (my laptop has 512MB RAM and a 
768MB swapfile IIRC).


Sean

Dec 04 2005

Dave <Dave_member pathlink.com> writes:

In article <Xns9722ABBE9EA2Fsvv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Manfred Nowak wrote:
 Preliminary results after some tuning of the parameters of my 
 scalable implementation:

After fixing some bugs the gain table looks like this:

Fixed key length like the 1st results, or variable like the second? If variable,
what does the set look like? Is it a 'word frequency' test similiar to Sean's
test?

How about some numbers on the shootout benchmarks - Is it too time consuming to
run those? No need to instrument them to isolate just the AA operations because
it's important and meaningful to see how your implementation runs in the context
of the other runtime overhead as well (you mentioned that it consumes quite a
bit more memory than the current AA impl).

The code was posted publically earlier so everyone (or is it just you, me and
Sean interested in this?) in this group could look at the numbers *and* the code
and get a better idea of exactly what the benefits of your implementation are
for different sets.

size	Gain
4	-31,57%
5	-28,42%
6	-29,10%
7	-14,63%
10	-14,15%
11	-22,89%
12	-24,09%
13	-22,29%
14	-24,36%
17	-68,07%
20	-61,68%
22	-82,17%
24	-82,36%
26	-33,69%

An additional problem is left for discussion:

Does it suffice to rely on the swapping algorithm of the underlaying 
OS or is it the role of the language to provide a solution when 
memory gets tight?

Usually out-of-memory errors and the like are thrown, but this forces 
every implementator to not trust the built-in language features and 
build a custom implementation. 

Does your impl. rely on the GC now? If so, I'd say you can leave these concerns
to the language and GC implementors, which would be consistent with the rest of
the D RT lib.

-manfred

Dec 04 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Dave wrote:

[...]
 Fixed key length like the 1st results, or variable like the
 second? If variable, what does the set look like? Is it a 'word
 frequency' test similiar to Sean's test?

Variable length keys.

The framework used is posted here:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30849

For the variable length keys test I replaced the assignments

!    r= rand();
!    aaa[ (r)%tsmall]= i;
 
with

!    r= rand();
!    sprintf( buf, toStringz("%d"), r%tsmall);
!    aaa[ toString( buf)]= i;

and then adjusted the other calls and declarations.

The results of the framework show, that the underlying abstract 
problem, which is choosen without any respect to a particular 
implementation, smoothes very well from mostly accesses to mostly 
insertions.

 
 How about some numbers on the shootout benchmarks - Is it too
 time consuming to run those? No need to instrument them to
 isolate just the AA operations because it's important and
 meaningful to see how your implementation runs in the context 
 of the other runtime overhead as well (you mentioned that it
 consumes quite a bit more memory than the current AA impl).

I will do shootout benchmarks next. The test runs show, that the 
average memory consumption increases by about factor 7 while the time 
consumption decreases up to divisor 5. Therefore even under the 
optimization criteria S*T**2 it seems to be a gain ;-)
 
[...]
 Does your impl. rely on the GC now? If so, I'd say you can leave
 these concerns to the language and GC implementors, which would
 be consistent with the rest of the D RT lib.

I do not understand this remark. Are you talking about management of 
the real memory whilst I was asking for the management of the virtual 
memory?

Hmmmm... AA's are the core of management of the virtual memory, 
aren't they?

-manfred

Dec 04 2005

Dave <Dave_member pathlink.com> writes:

I do not understand this remark. Are you talking about management of 
the real memory whilst I was asking for the management of the virtual 
memory?

Hmmmm... AA's are the core of management of the virtual memory, 
aren't they?

-manfred 

After reading:

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30905

I'm assuming that you are using the GC for the 'internals' of your
implementation, and now I have a better understanding of what you were getting
at.

IMHO, it is best to let the GC handle these issues (like the rest of the D
runtime library implementation), and let a future 'better' allocator/GC system
account for exceptional memory conditions. That way the AA implementation will
be stay consistent with the rest, along with being simpler to develop, debug and
maintain. The current GC might not address the situation you describe very well
right now, but that can always be addressed in the future if it becomes a
serious problem.

Dec 04 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Dave wrote:

[...]
 The current GC might not address the
 situation you describe very well right now, but that can always
 be addressed in the future if it becomes a serious problem.

After delivering a chunk of memory to the application the GC isnt 
aware of that memory anymore, except when allocating more memory.

Therefore a part of the time consuming swapping might stem from the 
activities of the GC, but that is the minor part.

Every time the application accesses a segment currently swapped out, 
the OS will need to swap it in again. That is the major reason for 
swapping and has nothing to do with the GC.

The GC does not know anything about the swapping mechanism of the 
underlying OSa nd therefore cannot be adapted to it.

But your argument together with the necessary coupling of the GC with 
the virtual managemnt looks like every language with a GC must also 
implement a manager for the virtual memory.

-manfred

Dec 04 2005

Sean Kelly <sean f4.ca> writes:

Manfred Nowak wrote:
 
 The GC does not know anything about the swapping mechanism of the 
 underlying OSa nd therefore cannot be adapted to it.

Not strictly true.  Some of the fancier GCs use data from the virtual 
memory manager to avoid paging.  Granted, this locks a GC implementation 
to a specific platform, but it's not unheard of.


Sean

Dec 05 2005

Dave <Dave_member pathlink.com> writes:

In article <Xns97235303CDD71svv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Dave wrote:

[...]
 The current GC might not address the
 situation you describe very well right now, but that can always
 be addressed in the future if it becomes a serious problem.

After delivering a chunk of memory to the application the GC isnt 
aware of that memory anymore, except when allocating more memory.

Therefore a part of the time consuming swapping might stem from the 
activities of the GC, but that is the minor part.

Every time the application accesses a segment currently swapped out, 
the OS will need to swap it in again. That is the major reason for 
swapping and has nothing to do with the GC.

The GC does not know anything about the swapping mechanism of the 
underlying OSa nd therefore cannot be adapted to it.

But your argument together with the necessary coupling of the GC with 
the virtual managemnt looks like every language with a GC must also 
implement a manager for the virtual memory.

-manfred

Ok - I gotcha now.. The earlier term 'error' threw me a little bit, but
basically you're talking about optimizing with regard to memory swapping because
of it's impact on performance, right? I hope so, because no one should have to
think this hard right after a cold beer <g>

Temporal and spatial locality issues have been a big area GC research for years
now from what I gather. Of course, one of the big advantages of generational,
compacting collectors is that they help to minimize swapping both during
collection and (generally) give better locality for the current working set in
the program itself. And even if those may not be the most efficient in terms of
CPU instructions, they seem to be proving to be the best for CPU *cycle*
efficiency.

From what I've read (and seen a bit of, actually), a non-generational,
non-compacting GC like the current one for D would be the biggest 'offender' for
swapping and VM 'abuse' during collection as it scans over what could be the
entire memory space depending on the working set.

An AA implementation that pre-allocates a really large amount of memory may
actually be defeating the purpose - making the job of the GC harder and actually
increasing swapping both during AA accesses and then when the collector kicks
in.

It's also very possible that designing a virtual memory manager tuned to
something specific like an AA implementation could actually be self defeating in
the context of a GC, and increase swapping as the VMM and GC compete for
locality of the AA memory and live memory for other data in the program.

All this to say that I don't think a separate virtual memory management system
tuned for one part of the runtime system would really be worth the effort and
might be self-defeating. The GC implementation is really the key, IMHO. The AA
implementation should really play in the same sandbox as the rest <g>

One idea is to make the memory pre-allocation for your AA implementation reverse
logoritmic - as the AA set gets larger, the ratio of pre-allocated memory
decreases. That should give really good performance for the more common, smaller
sets and help to decrease swapping as the sets get larger. Plus it would help
out the GC as it deals with other live data in the program.

- Dave

Dec 05 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Manfred Nowak wrote:

[...]
 I will do shootout benchmarks next.

[...]

First results on hash.d

Size    t.curr  t.me    t.cal   Gain
4       14,6    16      14,6    
7       14,7    16      14,7    
10      15,3    17,4    15,2    
11      16,7    18,4    15,9    212,50%
12      18,9    21,5    17,2    152,94%
14      32,5    40      26,5    125,00%
15      61      62      40      4,76%
16      108     106     65      -4,65%
17      291     214     123     -45,83%
20      2610    2015    909     -34,98%
22      25094   17466   1917    -32,91%
23      96496   69447   3950    -29,23%
24      >180000 371049          


Time for current implementation is not measurable for parameter 
values less 1000.

As already shown, my implementation is faster for larger amounts of 
stored data, but because increased memory allocation reaches the 
limits of physical memory earlier and then collapses.

Some constants of the implementation can be adjusted so that losses 
in cases of fewer elements are less, but gains in cases of more 
lements are also lower.

How to decide for the general case?

-manfred

Dec 07 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

In article <Xns9725F3778AE4Dsvv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Manfred Nowak wrote:

[...]
 I will do shootout benchmarks next.

[...]

First results on hash.d

Size    t.curr  t.me    t.cal   Gain
4       14,6    16      14,6    
7       14,7    16      14,7    
10      15,3    17,4    15,2    
11      16,7    18,4    15,9    212,50%
12      18,9    21,5    17,2    152,94%
14      32,5    40      26,5    125,00%
15      61      62      40      4,76%
16      108     106     65      -4,65%
17      291     214     123     -45,83%
20      2610    2015    909     -34,98%
22      25094   17466   1917    -32,91%
23      96496   69447   3950    -29,23%
24      >180000 371049          


Time for current implementation is not measurable for parameter 
values less 1000.

As already shown, my implementation is faster for larger amounts of 
stored data, but because increased memory allocation reaches the 
limits of physical memory earlier and then collapses.
How to decide for the general case?

Can't you just make the algorithm change at a certain size?

/Oskar

Dec 08 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Oskar Linde wrote:

[...]
 Can't you just make the algorithm change at a certain size?

Maybe, but how to decide that point for the general case? Which 
measure should be taken?

-manfred

Dec 08 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Hi,

In article <Xns9726A6F56EEE0svv1999hotmailcom 63.105.9.61>, Manfred Nowak
says...
Oskar Linde wrote:

[...]
 Can't you just make the algorithm change at a certain size?

Maybe, but how to decide that point for the general case? Which 
measure should be taken?

It's extremly difficult to define a general case. I'd guess that in most cases,
AA are used for convenience, neither speed nor size needs to be optimized. Then
there are the specific uses with specific needs. It is impossible to make a
optimal choice without having the case that should be optimized. The goal for a
general implementation should probably be to perform decently in many cases than
excellently in few.

Performance tuning parameters are easy to change in the future. I'd say, make
your best guess at a general usage case, and tune the parameters to this case.
Then verify your parameters against a number of different test cases (different
hash functions: good/poor, slow/fast, different usage patterns: ratio of
insertion, deletes and lookups, different size of key/value data, etc...). Make
sure that there are no places where you are considerably worse than a reference
implementation.

I think you will find that cut-off points between different algorithms are not
very sensitive. They would probably be placed in the middle of a range where the
algorithms perform similarly.

If you by measure mean time vs space I'd say that since hash tables primary
feature is time complexity, speed should be more important than memory usage,
but there should also be a weighing between those. A 10 % speed increase at the
cost of 100 % memory usage is probably not worth it for the general case.

(I feel like I've only written vague and obvious things. Apologies for the waste
of bandwidth :)

Regards,

Oskar

Dec 08 2005

D Programming

C/C++ Programming

Other

digitalmars.D - Refining AA's