www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - AA performance again

reply bearophile <bearophileHUGS lycos.com> writes:
Sometimes the simplest code is enough to show a problem. Other gentle people in
the #D channel say those timings aren't random:
http://leonardo-m.livejournal.com/64284.html

The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd
with 20 million longs.

Bye,
bearophile
Jun 09 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
Have you tried dcollections HashMap?

www.dsource.org/projects/dcollections

If you use with Tango, the memory performance should be better (I have a 
special allocator that requires some of tango's GC functions).

On my box (don't have python installed), 10m inserts took 15.3s, and the 
memory usage was 290MB (I think :) )

Of course, the python numbers are much better, but the issue basically comes 
down to when you allocate more memory, the GC tries to run and free some up 
too often.  There's not a lot I can do about that.

-Steve 
Jun 09 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:
 Have you tried dcollections HashMap?
 www.dsource.org/projects/dcollections

Not yet, I may try them later.
 On my box (don't have python installed), 10m inserts took 15.3s, and the 
 memory usage was 290MB (I think :) )

If you want, there are "mobile" versions of Python that doesn't require an installation. I may want to ask you what's your CPU/clock.
 Of course, the python numbers are much better, but the issue basically comes 
 down to when you allocate more memory, the GC tries to run and free some up 
 too often.  There's not a lot I can do about that.

I don't think it's just a matter of memory allocator (but I have no proof of this), there are many different macro and subtle ways to implement an hash map in C. Two more things: - In my blog post I have added a third Python version that's better, it creates a true dict (AA): dict.fromkeys(xrange(20000000), 0) - I have used D longs because Python ints are like that (Python 2.5 uses multiprecision numbers later), but Python integers are objects (so they are "boxed"). Inside sets and dicts their memory usage for Python 2.5 is: set(int): 28.2 byte/element dict(int:None): 36.2 byte/element Thank you for your answer, bear hugs, bearophile
Jun 09 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"bearophile" wrote
 Steven Schveighoffer:
 On my box (don't have python installed), 10m inserts took 15.3s, and the
 memory usage was 290MB (I think :) )

If you want, there are "mobile" versions of Python that doesn't require an installation. I may want to ask you what's your CPU/clock.

CPU is dual-core Xeon 1.6GHz, only using one core of course. To give you an idea of the speedup, the builtin AA's take 40.5 seconds to do 10m inserts on my system.
 Of course, the python numbers are much better, but the issue basically 
 comes
 down to when you allocate more memory, the GC tries to run and free some 
 up
 too often.  There's not a lot I can do about that.

I don't think it's just a matter of memory allocator (but I have no proof of this), there are many different macro and subtle ways to implement an hash map in C.

At least in D, that seems to be the bottleneck. This is why I get such a huge speedup using a custom allocator. You might find this interesting, print out the time it takes for each million inserts :)
 Two more things:
 - In my blog post I have added a third Python version that's better, it 
 creates a true dict (AA):
 dict.fromkeys(xrange(20000000), 0)
 - I have used D longs because Python ints are like that (Python 2.5 uses 
 multiprecision numbers later), but Python integers are objects (so they 
 are "boxed"). Inside sets and dicts their memory usage for Python 2.5 is:
 set(int): 28.2 byte/element
 dict(int:None): 36.2 byte/element

You might also want to try setting the value of each element differently. I'm not sure it will make a difference, but perhaps python is doing something clever if all the values are 0. The times you quote seem awfully fast, but the complexity of the issues is really beyond my understanding. -Steve
Jun 09 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:
To give you an idea of the speedup, the builtin AA's take 40.5 seconds to do
10m inserts on my system.<

I see, quite different from about 15 s.
print out the time it takes for each million inserts :)

Okay, for 10 million inserts: 0.44 s 0.75 s 1.20 s 1.34 s 2.09 s 2.62 s 2.33 s 1.92 s 4.67 s 5.48 s Total time: 24.0 s The code I have used: import std.conv: toUint; import d.time:clock;//www.fantascienza.net/leonardo/so/libs_d.zip void main() { size_t[long] aa; for (uint j; j < 10; ++j) { float t = clock(); uint min = 1_000_000 * j; uint max = 1_000_000 * (j + 1); for (uint i = min; i < max; ++i) aa[i] = 0; printf("%f\n", clock() - t); } }
You might also want to try setting the value of each element differently. I'm
not sure it will make a difference, but perhaps python is doing something
clever if all the values are 0.<

Okay, but I think the results will not change much. The D timings show very little changes: 1_000_000 0.48 s 1_000_000 1.26 s 5_000_000 5.97 s 10_000_000 24.14 s The D code I have used: import std.conv: toUint; void main(string[] args) { uint n = toUint(args[1]); size_t[long] aa; for (uint i; i < n; ++i) aa[i] = i; } The Python timings are exactly the same: 1_000_000 0.25 s 1_000_000 0.41 s 5_000_000 0.80 s 10_000_000 1.48 s 20_000_000 2.88 s The Python code I have used: import sys def main(n): aa = {} for i in xrange(n): aa[i] = i import psyco; psyco.bind(main) main(int(sys.argv[1])) Without Psyco timings are similar but a little higher. Disabling the garbage collector timings become a just a bit lower.
The times you quote seem awfully fast, but the complexity of the issues is
really beyond my understanding.<

Generally Python uses one or more dict access every time you use a variable, you read or write it, so they must be fast. In Python the sort (Timsort) and dicts/sets are optimized to death, they are now rather old, and not even Raymond Hettinger seem able to improve them :-) You may want to take a look here again, there are some more questions and answers: http://leonardo-m.livejournal.com/64284.html Bye, bearophile
Jun 09 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:
 CPU is dual-core Xeon 1.6GHz, only using one core of course.

But my new CPU has two cores, so I hope to see data structures able to use both cores at the same time, where possible... (I think in C# 3+ this is possible, sometimes). Bye, bearophile
Jun 09 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"bearophile" wrote
 Steven Schveighoffer:
 CPU is dual-core Xeon 1.6GHz, only using one core of course.

But my new CPU has two cores, so I hope to see data structures able to use both cores at the same time, where possible... (I think in C# 3+ this is possible, sometimes).

I don't see how hash insertions could be split into multiple-core operations without locking, and I think you would then introduce a huge slowdown. At the very least, memory allocation needs to be synchronized, and since that is where the current bottleneck is, we would end up at the same place. -Steve
Jun 09 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:
 I don't see how hash insertions could be split into multiple-core operations 
 without locking, and I think you would then introduce a huge slowdown.

Okay, sorry for my comment then. (But time ago I have found a difficult book about the implementation of purely functional data structures, among them there are hashes too. I don't know if on a dual core they can be an improvement). Bye, bearophile
Jun 09 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Sometimes the simplest code is enough to show a problem. Other gentle
 people in the #D channel say those timings aren't random: 
 http://leonardo-m.livejournal.com/64284.html
 
 The D versions eat 425 MB with 10 million longs and about 757 MB RAM
 with dmd with 20 million longs.

I've investigated a similar issue before, and discovered it has nothing to do with AAs. What is consuming the time is, as the memory usage grows, the garbage collector repeatedly kicks in to look for things to free. You can see this by wrapping the loop in: std.gc.disable(); ... loop ... std.gc.enable();
Jun 09 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
What is consuming the time is, as the memory usage grows, the garbage collector
repeatedly kicks in to look for things to free. You can see this by wrapping
the loop in:

... loop ... std.gc.enable(); < The new timings disabling the GC are interesting: 1_000_000 0.26 s 2_000_000 0.57 s 5_000_000 2.11 s 10_000_000 8.79 s 20_000_000 29.78 s The modified D code I have used is: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[long] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } ------------ The original timings for D were: 100_000 0.05 s 500_000 0.20 s 1_000_000 0.47 s 2_000_000 1.25 s 5_000_000 5.93 s 10_000_000 24.34 s 20_000_000 129.8 s The original timings for Python (version 3, the most fair) were, best of 3: 10_000_000: 1.56 s 20_000_000: 3.04 s So even without GC, with N = 20 millions, Python is about 9.8 times faster, despite using boxed (64 bit, for this range) integers. Thank you for your answer, bearophile
Jun 09 2008
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g2k51g$2j0j$1 digitalmars.com...
 Walter Bright:
What is consuming the time is, as the memory usage grows, the garbage 
collector repeatedly kicks in to look for things to free. You can see this 
by wrapping the loop in:

... loop ... std.gc.enable(); < The new timings disabling the GC are interesting: 1_000_000 0.26 s 2_000_000 0.57 s 5_000_000 2.11 s 10_000_000 8.79 s 20_000_000 29.78 s The modified D code I have used is: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[long] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } ------------ The original timings for D were: 100_000 0.05 s 500_000 0.20 s 1_000_000 0.47 s 2_000_000 1.25 s 5_000_000 5.93 s 10_000_000 24.34 s 20_000_000 129.8 s The original timings for Python (version 3, the most fair) were, best of 3: 10_000_000: 1.56 s 20_000_000: 3.04 s So even without GC, with N = 20 millions, Python is about 9.8 times faster, despite using boxed (64 bit, for this range) integers. Thank you for your answer, bearophile

for further reference, disabling the GC yields 6.5 seconds to do 10m inserts on my system using dcollections' HashMap. I think it would be faster on your system since your system's AAs do much better than mine. Here is my code: import dcollections.HashMap; import tango.core.Memory; void main() { const uint N = 10_000_000; auto aa = new HashMap!(long, size_t); tango.core.Memory.GC.disable(); for(uint i; i < N; ++i) aa[i] = 0; tango.core.Memory.GC.enable(); } -Steve
Jun 09 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
     size_t[long] aa;

Python uses 32 bit integers, so the equivalent D AA would be: size_t[int] aa; D longs are 64 bits.
Jun 09 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 Python uses 32 bit integers, so the equivalent D AA would be:
 	size_t[int] aa;
 D longs are 64 bits.

As usual you are right and I am wrong :-) My purpose is to have a good language to use, and you can see I like D :-) (I conflate AAs into the language because they are built in). Python 2.5 uses 32 signed numbers by default, but they are objects, and they are converted to multiprecision numbers whenever necessary:
 a = 2147483647
 type(a)



 b = a + 1
 b



 type(b)



In Python 3 all numbers will be longs (at the moment they are about 25-30% slower, I think, but Python 3 is in beta still). ------------------ I have tested with uints then (no GC), the new D code: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[uint] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } The timings seem the same: 1_000_000 0.27 s 2_000_000 0.59 s 5_000_000 2.14 s 10_000_000 8.87 s 20_000_000 29.9 s If someone can confirm such timings... Bye and thank you, bearophile
Jun 09 2008
parent reply davidl <davidl 126.com> writes:
在 Tue, 10 Jun 2008 06:02:00 +0800,bearophile <bearophileHUGS lycos.com>  
写道:

 Walter Bright:

 Python uses 32 bit integers, so the equivalent D AA would be:
 	size_t[int] aa;
 D longs are 64 bits.

As usual you are right and I am wrong :-) My purpose is to have a good language to use, and you can see I like D :-) (I conflate AAs into the language because they are built in). Python 2.5 uses 32 signed numbers by default, but they are objects, and they are converted to multiprecision numbers whenever necessary:
 a = 2147483647
 type(a)



 b = a + 1
 b



 type(b)



In Python 3 all numbers will be longs (at the moment they are about 25-30% slower, I think, but Python 3 is in beta still). ------------------ I have tested with uints then (no GC), the new D code: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[uint] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } The timings seem the same: 1_000_000 0.27 s 2_000_000 0.59 s 5_000_000 2.14 s 10_000_000 8.87 s 20_000_000 29.9 s If someone can confirm such timings... Bye and thank you, bearophile

I think the most important issue is D's AA(Phobos) is not balanaced. Tango should perform worse in this case. While this case is almost a useless showcase of testing. AA keys are seldom well organized in this way. And this way is actually the worst case for D's AA. You should firstly generate some random strings to a file, and read this file to test both pythons' and Ds' AA. It's ridiculous that python can be any bit faster than D. The crucial bottle neck I can see, is the bucket tree not balanced in D. Once this is fixed in phobos, you should get the best performance AA, other AA won't be exponetial times faster than this, if there's , your fault. -- 使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/
Jun 13 2008
parent Lars Ivar Igesund <larsivar igesund.net> writes:
davidl wrote:

 在 Tue, 10 Jun 2008 06:02:00 +0800,bearophile <bearophileHUGS lycos.com>
 写道:
 
 Walter Bright:

 Python uses 32 bit integers, so the equivalent D AA would be:
 size_t[int] aa;
 D longs are 64 bits.

As usual you are right and I am wrong :-) My purpose is to have a good language to use, and you can see I like D :-) (I conflate AAs into the language because they are built in). Python 2.5 uses 32 signed numbers by default, but they are objects, and they are converted to multiprecision numbers whenever necessary:
 a = 2147483647
 type(a)



 b = a + 1
 b



 type(b)



In Python 3 all numbers will be longs (at the moment they are about 25-30% slower, I think, but Python 3 is in beta still). ------------------ I have tested with uints then (no GC), the new D code: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[uint] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } The timings seem the same: 1_000_000 0.27 s 2_000_000 0.59 s 5_000_000 2.14 s 10_000_000 8.87 s 20_000_000 29.9 s If someone can confirm such timings... Bye and thank you, bearophile

I think the most important issue is D's AA(Phobos) is not balanaced. Tango should perform worse in this case.

Curious about your argumentation here (or test results showing that this is true), if you intended to write what you did. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Jun 13 2008
prev sibling parent Lutger <lutger.blijdestin gmail.com> writes:
bearophile wrote:
...

I have some different timings:

dmd without GC:
real 7.51
user 5.25
sys 0.79

dmd with GC:
real 206.53
user 167.50
sys 2.14

python version 1 not using psycho:
real 8.39
user 5.81
sys 1.06

python version 2:
real 4.86
user 3.15
sys 0.82

python version 3:
real 5.74
user 3.58
sys 1.10

I used python version 2.5.2 and dmd with the Tango runtime.

Here's the D version:

import tango.core.Memory;

void main() {

    GC.disable();
    const uint N = 20_000_000;
    size_t[long] aa;
    for (uint i; i < N; ++i)
        aa[i] = 0;
}
Jun 10 2008
prev sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
bearophile wrote:
 Sometimes the simplest code is enough to show a problem. Other gentle people
in the #D channel say those timings aren't random:
 http://leonardo-m.livejournal.com/64284.html
 
 The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd
with 20 million longs.
 
 Bye,
 bearophile

If the bottleneck is GC, then it might be interesting to see how a dictionary type that uses less memory (i.e. Judy or another tree-based implementation) compares instead of using a hash (obviously with a lot of pre-allocation). If the bottleneck is in aaA.d, then maybe that just needs to be optimized, and I'm sure at least the Tango folks have the knowledge to do that. I'm guessing they've done all they can already, though.
Jun 10 2008
parent Robert Fraser <fraserofthenight gmail.com> writes:
Robert Fraser wrote:
 i.e. Judy or another tree-based 

*trie, although tree-based would be something to look into, too.
Jun 10 2008