digitalmars.D - AA performance again

bearophile (5/5) Jun 09 2008 Sometimes the simplest code is enough to show a problem. Other gentle pe...

Steven Schveighoffer (10/10) Jun 09 2008 Have you tried dcollections HashMap?

bearophile (14/21) Jun 09 2008 If you want, there are "mobile" versions of Python that doesn't require ...

Steven Schveighoffer (12/35) Jun 09 2008 CPU is dual-core Xeon 1.6GHz, only using one core of course.

bearophile (62/66) Jun 09 2008 Okay, for 10 million inserts:
bearophile (5/6) Jun 09 2008 But my new CPU has two cores, so I hope to see data structures able to u...

Steven Schveighoffer (6/11) Jun 09 2008 I don't see how hash insertions could be split into multiple-core operat...

bearophile (5/7) Jun 09 2008 Okay, sorry for my comment then.

Walter Bright (8/14) Jun 09 2008 I've investigated a similar issue before, and discovered it has nothing

bearophile (37/38) Jun 09 2008 std.gc.disable();

Steven Schveighoffer (18/60) Jun 09 2008 for further reference, disabling the GC yields 6.5 seconds to do 10m ins...
Walter Bright (4/5) Jun 09 2008 Python uses 32 bit integers, so the equivalent D AA would be:

bearophile (29/37) Jun 09 2008 As usual you are right and I am wrong :-)

davidl (16/56) Jun 13 2008 I think the most important issue is D's AA(Phobos) is not balanaced.

Lars Ivar Igesund (8/65) Jun 13 2008 Curious about your argumentation here (or test results showing that this...

Lutger (33/33) Jun 10 2008 bearophile wrote:

Robert Fraser (8/15) Jun 10 2008 If the bottleneck is GC, then it might be interesting to see how a

Robert Fraser (2/3) Jun 10 2008 *trie, although tree-based would be something to look into, too.

bearophile <bearophileHUGS lycos.com> writes:

Sometimes the simplest code is enough to show a problem. Other gentle people in
the #D channel say those timings aren't random:
http://leonardo-m.livejournal.com/64284.html

The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd
with 20 million longs.

Bye,
bearophile

Jun 09 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

Have you tried dcollections HashMap?

www.dsource.org/projects/dcollections

If you use with Tango, the memory performance should be better (I have a 
special allocator that requires some of tango's GC functions).

On my box (don't have python installed), 10m inserts took 15.3s, and the 
memory usage was 290MB (I think :) )

Of course, the python numbers are much better, but the issue basically comes 
down to when you allocate more memory, the GC tries to run and free some up 
too often.  There's not a lot I can do about that.

-Steve

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:
 Have you tried dcollections HashMap?
 www.dsource.org/projects/dcollections

Not yet, I may try them later.


 On my box (don't have python installed), 10m inserts took 15.3s, and the 
 memory usage was 290MB (I think :) )

If you want, there are "mobile" versions of Python that doesn't require an
installation.
I may want to ask you what's your CPU/clock.


 Of course, the python numbers are much better, but the issue basically comes 
 down to when you allocate more memory, the GC tries to run and free some up 
 too often.  There's not a lot I can do about that.

I don't think it's just a matter of memory allocator (but I have no proof of
this), there are many different macro and subtle ways to implement an hash map
in C.


Two more things:
- In my blog post I have added a third Python version that's better, it creates
a true dict (AA):
dict.fromkeys(xrange(20000000), 0)
- I have used D longs because Python ints are like that (Python 2.5 uses
multiprecision numbers later), but Python integers are objects (so they are
"boxed"). Inside sets and dicts their memory usage for Python 2.5 is:
set(int): 28.2 byte/element
dict(int:None): 36.2 byte/element

Thank you for your answer,
bear hugs,
bearophile

Jun 09 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"bearophile" wrote
 Steven Schveighoffer:
 On my box (don't have python installed), 10m inserts took 15.3s, and the
 memory usage was 290MB (I think :) )

 If you want, there are "mobile" versions of Python that doesn't require an 
 installation.
 I may want to ask you what's your CPU/clock.

CPU is dual-core Xeon 1.6GHz, only using one core of course.

To give you an idea of the speedup, the builtin AA's take 40.5 seconds to do 
10m inserts on my system.

 Of course, the python numbers are much better, but the issue basically 
 comes
 down to when you allocate more memory, the GC tries to run and free some 
 up
 too often.  There's not a lot I can do about that.

 I don't think it's just a matter of memory allocator (but I have no proof 
 of this), there are many different macro and subtle ways to implement an 
 hash map in C.

At least in D, that seems to be the bottleneck.  This is why I get such a 
huge speedup using a custom allocator.  You might find this interesting, 
print out the time it takes for each million inserts :)

 Two more things:
 - In my blog post I have added a third Python version that's better, it 
 creates a true dict (AA):
 dict.fromkeys(xrange(20000000), 0)
 - I have used D longs because Python ints are like that (Python 2.5 uses 
 multiprecision numbers later), but Python integers are objects (so they 
 are "boxed"). Inside sets and dicts their memory usage for Python 2.5 is:
 set(int): 28.2 byte/element
 dict(int:None): 36.2 byte/element

You might also want to try setting the value of each element differently. 
I'm not sure it will make a difference, but perhaps python is doing 
something clever if all the values are 0.  The times you quote seem awfully 
fast, but the complexity of the issues is really beyond my understanding.

-Steve

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:
To give you an idea of the speedup, the builtin AA's take 40.5 seconds to do
10m inserts on my system.<

I see, quite different from about 15 s.


print out the time it takes for each million inserts :)

Okay, for 10 million inserts:

0.44 s
0.75 s
1.20 s
1.34 s
2.09 s
2.62 s
2.33 s
1.92 s
4.67 s
5.48 s
Total time: 24.0 s

The code I have used:

import std.conv: toUint;
import d.time:clock;//www.fantascienza.net/leonardo/so/libs_d.zip
void main() {
    size_t[long] aa;
    for (uint j; j < 10; ++j) {
        float t = clock();
        uint min = 1_000_000 * j;
        uint max = 1_000_000 * (j + 1);
        for (uint i = min; i < max; ++i)
            aa[i] = 0;
        printf("%f\n", clock() - t);
    }
}


You might also want to try setting the value of each element differently. I'm
not sure it will make a difference, but perhaps python is doing something
clever if all the values are 0.<

Okay, but I think the results will not change much.

The D timings show very little changes:
 1_000_000   0.48 s
 1_000_000   1.26 s
 5_000_000   5.97 s
10_000_000  24.14 s

The D code I have used:

import std.conv: toUint;
void main(string[] args) {
    uint n = toUint(args[1]);
    size_t[long] aa;
    for (uint i; i < n; ++i)
        aa[i] = i;
}


The Python timings are exactly the same:
 1_000_000   0.25 s
 1_000_000   0.41 s
 5_000_000   0.80 s
10_000_000   1.48 s
20_000_000   2.88 s

The Python code I have used:

import sys
def main(n):
    aa = {}
    for i in xrange(n):
        aa[i] = i
import psyco; psyco.bind(main)
main(int(sys.argv[1]))

Without Psyco timings are similar but a little higher. Disabling the garbage
collector timings become a just a bit lower.


The times you quote seem awfully fast, but the complexity of the issues is
really beyond my understanding.<

Generally Python uses one or more dict access every time you use a variable,
you read or write it, so they must be fast. In Python the sort (Timsort) and
dicts/sets are optimized to death, they are now rather old, and not even
Raymond Hettinger seem able to improve them :-)

You may want to take a look here again, there are some more questions and
answers:
http://leonardo-m.livejournal.com/64284.html

Bye,
bearophile

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:
 CPU is dual-core Xeon 1.6GHz, only using one core of course.

But my new CPU has two cores, so I hope to see data structures able to use both
cores at the same time, where possible...


Bye,
bearophile

Jun 09 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"bearophile" wrote
 Steven Schveighoffer:
 CPU is dual-core Xeon 1.6GHz, only using one core of course.

 But my new CPU has two cores, so I hope to see data structures able to use 
 both cores at the same time, where possible...


I don't see how hash insertions could be split into multiple-core operations 
without locking, and I think you would then introduce a huge slowdown.  At 
the very least, memory allocation needs to be synchronized, and since that 
is where the current bottleneck is, we would end up at the same place.

-Steve

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:
 I don't see how hash insertions could be split into multiple-core operations 
 without locking, and I think you would then introduce a huge slowdown.

Okay, sorry for my comment then.
(But time ago I have found a difficult book about the implementation of purely
functional data structures, among them there are hashes too. I don't know if on
a dual core they can be an improvement).

Bye,
bearophile

Jun 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 Sometimes the simplest code is enough to show a problem. Other gentle
 people in the #D channel say those timings aren't random: 
 http://leonardo-m.livejournal.com/64284.html
 
 The D versions eat 425 MB with 10 million longs and about 757 MB RAM
 with dmd with 20 million longs.


I've investigated a similar issue before, and discovered it has nothing 
to do with AAs. What is consuming the time is, as the memory usage 
grows, the garbage collector repeatedly kicks in to look for things to 
free. You can see this by wrapping the loop in:

     std.gc.disable();
     ... loop ...
     std.gc.enable();

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
What is consuming the time is, as the memory usage grows, the garbage collector
repeatedly kicks in to look for things to free. You can see this by wrapping
the loop in:

     std.gc.disable();
     ... loop ...
     std.gc.enable();
<

The new timings disabling the GC are interesting:
1_000_000     0.26 s
2_000_000     0.57 s
5_000_000     2.11 s
10_000_000    8.79 s
20_000_000   29.78 s


The modified D code I have used is:

import std.conv: toUint;
import std.gc: disable, enable;
void main(string[] args) {
    uint n = toUint(args[1]);
    size_t[long] aa;
    disable;
    for (uint i; i < n; ++i)
        aa[i] = 0;
    enable;
}

------------

The original timings for D were:
 100_000      0.05 s
 500_000      0.20 s
1_000_000     0.47 s
2_000_000     1.25 s
5_000_000     5.93 s
10_000_000   24.34 s
20_000_000  129.8  s

The original timings for Python (version 3, the most fair) were, best of 3:
10_000_000: 1.56 s
20_000_000: 3.04 s

So even without GC, with N = 20 millions, Python is about 9.8 times faster,
despite using boxed (64 bit, for this range) integers.

Thank you for your answer,
bearophile

Jun 09 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g2k51g$2j0j$1 digitalmars.com...
 Walter Bright:
What is consuming the time is, as the memory usage grows, the garbage 
collector repeatedly kicks in to look for things to free. You can see this 
by wrapping the loop in:

     std.gc.disable();
     ... loop ...
     std.gc.enable();
 <

 The new timings disabling the GC are interesting:
 1_000_000     0.26 s
 2_000_000     0.57 s
 5_000_000     2.11 s
 10_000_000    8.79 s
 20_000_000   29.78 s


 The modified D code I have used is:

 import std.conv: toUint;
 import std.gc: disable, enable;
 void main(string[] args) {
    uint n = toUint(args[1]);
    size_t[long] aa;
    disable;
    for (uint i; i < n; ++i)
        aa[i] = 0;
    enable;
 }

 ------------

 The original timings for D were:
 100_000      0.05 s
 500_000      0.20 s
 1_000_000     0.47 s
 2_000_000     1.25 s
 5_000_000     5.93 s
 10_000_000   24.34 s
 20_000_000  129.8  s

 The original timings for Python (version 3, the most fair) were, best of 
 3:
 10_000_000: 1.56 s
 20_000_000: 3.04 s

 So even without GC, with N = 20 millions, Python is about 9.8 times 
 faster, despite using boxed (64 bit, for this range) integers.

 Thank you for your answer,
 bearophile

for further reference, disabling the GC yields 6.5 seconds to do 10m inserts 
on my system using dcollections' HashMap.  I think it would be faster on 
your system since your system's AAs do much better than mine.

Here is my code:

import dcollections.HashMap;
import tango.core.Memory;

void main()
{
    const uint N = 10_000_000;
    auto aa = new HashMap!(long, size_t);
    tango.core.Memory.GC.disable();
    for(uint i; i < N; ++i)
        aa[i] = 0;
    tango.core.Memory.GC.enable();
}

-Steve

Jun 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
     size_t[long] aa;

Python uses 32 bit integers, so the equivalent D AA would be:

	size_t[int] aa;

D longs are 64 bits.

Jun 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 Python uses 32 bit integers, so the equivalent D AA would be:
 	size_t[int] aa;
 D longs are 64 bits.

As usual you are right and I am wrong :-)
My purpose is to have a good language to use, and you can see I like D :-) (I
conflate AAs into the language because they are built in).

Python 2.5 uses 32 signed numbers by default, but they are objects, and they
are converted to multiprecision numbers whenever necessary:

 a = 2147483647
 type(a)



<type 'int'>
 b = a + 1
 b



2147483648L
 type(b)



<type 'long'>

In Python 3 all numbers will be longs (at the moment they are about 25-30%
slower, I think, but Python 3 is in beta still).

------------------

I have tested with uints then (no GC), the new D code:

import std.conv: toUint;
import std.gc: disable, enable;
void main(string[] args) {
    uint n = toUint(args[1]);
    size_t[uint] aa;
    disable;
    for (uint i; i < n; ++i)
        aa[i] = 0;
    enable;
}

The timings seem the same:
1_000_000     0.27 s
2_000_000     0.59 s
5_000_000     2.14 s
10_000_000    8.87 s
20_000_000   29.9  s

If someone can confirm such timings...

Bye and thank you,
bearophile

Jun 09 2008

davidl <davidl 126.com> writes:

在 Tue, 10 Jun 2008 06:02:00 +0800，bearophile <bearophileHUGS lycos.com>  
写道:

 Walter Bright:

 Python uses 32 bit integers, so the equivalent D AA would be:
 	size_t[int] aa;
 D longs are 64 bits.

 As usual you are right and I am wrong :-)
 My purpose is to have a good language to use, and you can see I like D  
 :-) (I conflate AAs into the language because they are built in).

 Python 2.5 uses 32 signed numbers by default, but they are objects, and  
 they are converted to multiprecision numbers whenever necessary:

 a = 2147483647
 type(a)



 <type 'int'>
 b = a + 1
 b



 2147483648L
 type(b)



 <type 'long'>

 In Python 3 all numbers will be longs (at the moment they are about  
 25-30% slower, I think, but Python 3 is in beta still).

 ------------------

 I have tested with uints then (no GC), the new D code:

 import std.conv: toUint;
 import std.gc: disable, enable;
 void main(string[] args) {
     uint n = toUint(args[1]);
     size_t[uint] aa;
     disable;
     for (uint i; i < n; ++i)
         aa[i] = 0;
     enable;
 }

 The timings seem the same:
 1_000_000     0.27 s
 2_000_000     0.59 s
 5_000_000     2.14 s
 10_000_000    8.87 s
 20_000_000   29.9  s

 If someone can confirm such timings...

 Bye and thank you,
 bearophile

I think the most important issue is D's AA(Phobos) is not balanaced.
Tango should perform worse in this case.
While this case is almost a useless showcase of testing. AA keys are  
seldom well organized in this way.
And this way is actually the worst case for D's AA.

You should firstly generate some random strings to a file, and read this  
file to test both pythons' and Ds' AA.

It's ridiculous that python can be any bit faster than D.
The crucial bottle neck I can see, is the bucket tree not balanced in D.  
Once this is fixed in phobos, you should get the best performance AA,  
other AA won't be exponetial times faster than this, if there's , your  
fault.





-- 
使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/

Jun 13 2008

Lars Ivar Igesund <larsivar igesund.net> writes:

davidl wrote:

 在 Tue, 10 Jun 2008 06:02:00 +0800，bearophile <bearophileHUGS lycos.com>
 写道:
 
 Walter Bright:

 Python uses 32 bit integers, so the equivalent D AA would be:
 size_t[int] aa;
 D longs are 64 bits.

 As usual you are right and I am wrong :-)
 My purpose is to have a good language to use, and you can see I like D
 :-) (I conflate AAs into the language because they are built in).

 Python 2.5 uses 32 signed numbers by default, but they are objects, and
 they are converted to multiprecision numbers whenever necessary:

 a = 2147483647
 type(a)



 <type 'int'>
 b = a + 1
 b



 2147483648L
 type(b)



 <type 'long'>

 In Python 3 all numbers will be longs (at the moment they are about
 25-30% slower, I think, but Python 3 is in beta still).

 ------------------

 I have tested with uints then (no GC), the new D code:

 import std.conv: toUint;
 import std.gc: disable, enable;
 void main(string[] args) {
     uint n = toUint(args[1]);
     size_t[uint] aa;
     disable;
     for (uint i; i < n; ++i)
         aa[i] = 0;
     enable;
 }

 The timings seem the same:
 1_000_000     0.27 s
 2_000_000     0.59 s
 5_000_000     2.14 s
 10_000_000    8.87 s
 20_000_000   29.9  s

 If someone can confirm such timings...

 Bye and thank you,
 bearophile

 
 I think the most important issue is D's AA(Phobos) is not balanaced.
 Tango should perform worse in this case.

Curious about your argumentation here (or test results showing that this is
true), if you intended to write what you did.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Jun 13 2008

Lutger <lutger.blijdestin gmail.com> writes:

bearophile wrote:
...

I have some different timings:

dmd without GC:
real 7.51
user 5.25
sys 0.79

dmd with GC:
real 206.53
user 167.50
sys 2.14

python version 1 not using psycho:
real 8.39
user 5.81
sys 1.06

python version 2:
real 4.86
user 3.15
sys 0.82

python version 3:
real 5.74
user 3.58
sys 1.10

I used python version 2.5.2 and dmd with the Tango runtime.

Here's the D version:

import tango.core.Memory;

void main() {

    GC.disable();
    const uint N = 20_000_000;
    size_t[long] aa;
    for (uint i; i < N; ++i)
        aa[i] = 0;
}

Jun 10 2008

Robert Fraser <fraserofthenight gmail.com> writes:

bearophile wrote:
 Sometimes the simplest code is enough to show a problem. Other gentle people
in the #D channel say those timings aren't random:
 http://leonardo-m.livejournal.com/64284.html
 
 The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd
with 20 million longs.
 
 Bye,
 bearophile

If the bottleneck is GC, then it might be interesting to see how a 
dictionary type that uses less memory (i.e. Judy or another tree-based 
implementation) compares instead of using a hash (obviously with a lot 
of pre-allocation).

If the bottleneck is in aaA.d, then maybe that just needs to be 
optimized, and I'm sure at least the Tango folks have the knowledge to 
do that. I'm guessing they've done all they can already, though.

Jun 10 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Robert Fraser wrote:
 i.e. Judy or another tree-based 

*trie, although tree-based would be something to look into, too.

Jun 10 2008

D Programming

C/C++ Programming

Other

digitalmars.D - AA performance again