www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Profiling graph library

reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
Hello all,

I thought I'd share some profiling results from the graph/network library that
I've been working on -- see:
https://github.com/WebDrake/Dgraph
http://braingam.es/2013/07/complex-networks-in-d/
http://braingam.es/2013/07/complex-networks-in-d-part-2-building-the-network/

I'd like to get others' thoughts and insight on these -- obviously I have some
ideas of my own about how to interpret them and move forward, but it's always
good to have feedback (and there are several things I'd like to make sure I have
my facts straight on before making any more "public" comment).

The basic data structure, which is taken from the igraph library, consists of a
pair of arrays of head and tail vertices for edges, together with indices which
sort the edges according to head or tail vertex, and sums of the total number of
edges whose head/tail ID is less than a certain value:
https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L33-L38

Taken together this offers a very memory-efficient structure that should also be
able to minimize the total number of allocations required to build a graph.

When I started working on the code the initial goal was to have something that
gave a very simple and obvious representation of the algorithms at work, which
would be easy to understand.  So, for example, I used std.algorithm.map quite
freely to make an "easy" representation of how a vertex's neighbours were
calculated:
https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L260-L286

It turns out this results in a fairly major speed hit that stems from a variety
of factors.  The following is the result of profiling a simple function that
just iterates over all the neighbours of each vertex in turn, and sums the
neighbour IDs.  The code is attached in neighbours.d.

---------------------------------------------------------------------------------
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 19.20      0.24     0.24 50000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda46TmZ10__lambda46MFNbNfmZxm
 18.40      0.47     0.23
_D2gc3gcx2GC12mallocNoSyncMFmkPmZPv
 15.60      0.67     0.20 50000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda48TmZ10__lambda48MFNbNfmZxm
  9.60      0.79     0.12   500000     0.00     0.00
_D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFKC6dgraph5graph13__T5GraphVb0Z5GraphZm
  8.00      0.89     0.10 25000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result
  8.00      0.99     0.10                            
_D2gc3gcx2GC6mallocMFmkPmZPv
  4.80      1.05     0.06                            
_D2gc6gcbits6GCBits3setMFmZv
  4.80      1.11     0.06                            
_D2gc3gcx3Gcx11fullcollectMFZm
  4.00      1.16     0.05                            
_D2gc6gcbits6GCBits4testMFmZm
  2.40      1.19     0.03
_D4core4sync5mutex5Mutex6unlockMFNeZv
  1.60      1.21     0.02                            
_D2gc6gcbits6GCBits5allocMFmZv
  1.20      1.22     0.02
_D3std4conv15__T6toImplTiTkZ6toImplFNaNfkZi15__dgliteral2501MFNaNfZC6object9Throwable
  1.20      1.24     0.02                             gc_malloc
  0.40      1.24     0.01
_D4core4sync5mutex5Mutex12MonitorProxy11__xopEqualsFKxS4core4sync5mutex5Mutex12MonitorProxyKxS4core4sync5mutex5Mutex12MonitorProxyZb
  0.40      1.25     0.01
_D4core4sync5mutex5Mutex4lockMFNeZv
  0.40      1.25     0.01                             gc_clrAttr
---------------------------------------------------------------------------------

What's striking about this is (i) the various range/map based functions carry a
substantial hit in and of themselves; (ii) there's also a fairly nasty hit from
the GC.  (If I run this with a directed graph, which avoids the use of
std.range.chain, there's a little bit of saving, but only a very little.)

As an experiment, I wrote some alternative code that provides a cache for all
the neighbours of all the vertices.  This takes an overall memory hit but means
that one can avoid having to recalculate the list of neighbours so long as the
graph itself doesn't change:
https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L42-L53
https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L358-L379

The speed benefits are immediate:

---------------------------------------------------------------------------------
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 64.29      0.14     0.14 25000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMFNaNbNfymZAm
 33.33      0.21     0.07   500000     0.00     0.00
_D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFNaNbNfKC6dgraph5graph13__T5GraphVb0Z5GraphZm
  2.38      0.21     0.01
_D6dgraph5graph13__T5GraphVb0Z5Graph13incidentEdgesMFNaNbNfymZAm
---------------------------------------------------------------------------------

In fact there's a double benefit here because (i) it's more efficient to write
the neighbour values into an array than to use a map and (ii) because we're
cacheing them, we only have to calculate them once.  The extra memory required
may be bearable given that the overall constraints on network size are probably
going to be down to something other than RAM.

However, I think it might be worth avoiding this if possible.  So I'd like to be
sure that I understand -- first, are map, chain, etc. likely to become more
efficient in terms of speed and memory use?  I recall that there has been some
concern over unnecessary allocations, and I'm not sure if std.algorithm.map or
std.range.chain are victims here.  Alternatively, there might be some benefit in
replacing the maps, chains, etc. with custom data structures that provide
_exactly_ what is wanted.

Finally, whether using the map approach or the cache, neither of these
approaches seems as fast in benchmarking as a simple ad-hoc approach as I might
use for a graph in simulation, such as,

    size_t[][] neighbours;

My instinct is that the latter approach will start to choke when you start
trying to build a really big graph because of the many, many reallocations
required to build the structure.  It will probably also become unwieldy if you
start wanting to allow arbitrary collections of edge and vertex properties
(weight, colour, etc.) because it doesn't allow for a very elegant way to store
and retrieve those properties.

Suffice to say that it looks like there are an interesting range of compromises
to make depending on needs, and it's quite likely there's a benefit in providing
different graph data structures depending on what scale of graph (and what kind
of graph properties) you want to work with.

Anyway, although this is a very limited bit of profiling work, I'd be very
grateful for anyone's thoughts or insight on the issues raised here.

Thanks & best wishes,

     -- Joe
Jul 30 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
There are many ways to design a graph data structure. All of them 
have tradeoffs, they can't be very good for everything.

For your benchmarks are you using the ldc2 compiler with correct 
compilation switches?

map() should not allocate memory.

Sometimes lazy code is faster and sometimes it's slower. A good 
compiler (like Haskell GHC) should be able to optimize map well.

Also try to use a bitvector from Phobos instead of bool[].

Sometimes you can keep memory allocation low, performance almost 
acceptable, and code easy to read using a pre-allocated scratch 
space plus using map() and copy() from std.algorithm.

Bye,
bearophile
Jul 30 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 30 July 2013 at 23:02:40 UTC, bearophile wrote:
 Sometimes you can keep memory allocation low, performance 
 almost acceptable, and code easy to read using a pre-allocated 
 scratch space plus using map() and copy() from std.algorithm.
Any chance of some nice examples of this?
Jul 30 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
John Colvin:

 Any chance of some nice examples of this?
Allocate a buffer somewhere, on the stack or heap, and then instead of using a map()+array() as in that code use map()+copy() plus a test to be sure the space is sufficient. But in the end I don't know how much well even ldc2 will compile that. You have to take a look at the produced asm with the "-output-s" switch of ldc2. Bye, bearophile
Jul 30 2013
prev sibling next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 07/31/2013 01:02 AM, bearophile wrote:
 There are many ways to design a graph data structure. All of them have
 tradeoffs, they can't be very good for everything.
Sure, it's just that in this case the reference code (igraph) is _much_ more performant. I'm keen to try and identify the major issues that are holding the D code back. I _could_ just copy the C algorithms from igraph into my own code, but nothing would be learned from that, and it would seem to defeat the purpose of using a more elegant language like D.
 For your benchmarks are you using the ldc2 compiler with correct compilation
 switches?
No, gdmd -O -release -inline -g -profile, and then gprof to generate the profile.
 map() should not allocate memory.
 
 Sometimes lazy code is faster and sometimes it's slower. A good compiler (like
 Haskell GHC) should be able to optimize map well.
That's what I'd assumed, which was why I was so disappointed that in this case performance was _very_ bad. I particularly don't understand the major GC hit. Or rather, I should say, I don't see why that GC hit should be there, unless it's that there is a fault in the implementation of iota, map and/or chain.
 Also try to use a bitvector from Phobos instead of bool[].
I did consider that, I haven't tried yet. Obviously it's superior from a storage point of view, I wasn't sure if it would be worse or equivalent performance-wise (I seem to recall reading that it's slightly slower than a bool[]). I'll try it. :-)
 Sometimes you can keep memory allocation low, performance almost acceptable,
and
 code easy to read using a pre-allocated scratch space plus using map() and
 copy() from std.algorithm.
Could you expand a little on that point? I don't mean to ask you to write my code for me, it's just that (like most non-computer-science researchers) my knowledge of programming is somewhat specialized, and there are blank spots on the map. So when in your reply to John you say, "Allocate a buffer somewhere, on the stack or heap", I'm still not sure exactly what you mean or how you'd use it. It'd really help to see a concrete example (it doesn't need to be tailored to the use-case here). In any case, thanks as ever for all the useful remarks and advice. :-)
Jul 30 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 No, gdmd -O -release -inline -g -profile, and then gprof to 
 generate the profile.
Then I suggest to try ldc2 too.
 That's what I'd assumed, which was why I was so disappointed 
 that in this case
 performance was _very_ bad.  I particularly don't understand 
 the major GC hit.

 Or rather, I should say, I don't see why that GC hit should be 
 there, unless
 it's that there is a fault in the implementation of iota, map 
 and/or chain.
My suggestion is to copy the implementation of those ranges in a test module, and try to understand what are the causes of the GC activity you see.
 Could you expand a little on that point?

 I don't mean to ask you to write my code for me, it's just that 
 (like most
 non-computer-science researchers) my knowledge of programming 
 is somewhat
 specialized, and there are blank spots on the map.  So when in 
 your reply to
 John you say, "Allocate a buffer somewhere, on the stack or 
 heap", I'm still not
 sure exactly what you mean or how you'd use it.  It'd really 
 help to see a
 concrete example (it doesn't need to be tailored to the 
 use-case here).
If this resulting array is used only for a short period of time: return sequence.map!(x => x.func).array; then perhaps you can perform an array allocation in a constructor, static constructor, or you could allocate a fixed sized array inside the body of the graph, etc. Let's say this array is named "buf". Then with code like: auto left = sequence.map!(x => x.func).copy(buf[]); return buf[...]; left tells you how much is left, so you can slice the buf and return it. Somewhere you also have to safe the length of the used buf. If the resulting array is used only for a moment, this will not cause troubles in a single thread program. The point is to remove the memory allocation caused by array(). An alternative design comes from Tango, instead of allocating the buffer somewhere, your function takes an optional buffer, and uses it unless it's too much small for the purpose, otherwise it allocates with array() or something else. Bye, bearophile
Jul 31 2013
next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 07/31/2013 02:31 PM, bearophile wrote:
 Then I suggest to try ldc2 too.
I haven't profiled with ldc2 yet (I will) but the broad speed tendencies are similar. It's faster but not so much faster that I can assume the speed hit isn't still there.
 My suggestion is to copy the implementation of those ranges in a test module,
 and try to understand what are the causes of the GC activity you see.
I'll look into that. Thanks for the suggestion. :-)
 If this resulting array is used only for a short period of time:
 
 return sequence.map!(x => x.func).array;
 
 
 then perhaps you can perform an array allocation in a constructor, static
 constructor, or you could allocate a fixed sized array inside the body of the
 graph, etc. Let's say this array is named "buf".
 
 Then with code like:
 
 auto left = sequence.map!(x => x.func).copy(buf[]);
 return buf[...];
 
 left tells you how much is left, so you can slice the buf and return it.
 Somewhere you also have to safe the length of the used buf.
 If the resulting array is used only for a moment, this will not cause troubles
 in a single thread program. The point is to remove the memory allocation caused
 by array().
 
 An alternative design comes from Tango, instead of allocating the buffer
 somewhere, your function takes an optional buffer, and uses it unless it's too
 much small for the purpose, otherwise it allocates with array() or something
else.
Oh, I see. Yes, that makes sense and could be useful. Actually, something like this (but not done as elegantly as your example) was the reason behind the cached version of the graph. I did think of just returning a small buffer with the neighbours in, but I was concerned that if someone wanted to keep using the resulting array over a longer period of time, errors could creep in. So, I came up with the full cache that is memory-heavy but safer as any copies you take will remain valid so long as the graph itself is not altered. But most likely the best compromise solution is as you suggest a buffer, with the instruction ".dup the result if you want it to be persistent".
Jul 31 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 But most likely the best compromise solution is as you
 suggest a buffer, with the instruction ".dup the result
 if you want it to be persistent".
In my library code often I do the opposite: I add a template boolean switch named as "doCopy" that defaults to true. If it's true, the function copies the buffer. So if you don't know what you are doing, or you don't care about performance (and this happens often), it's safer. When you know what you are doing or your profiling (or your experience) tells you you want more performance, you call the function with a false doCopy, and it doesn't allocate a new buffer. This is the design I suggested for File.byLine() but Andrei ha preferred a simpler design that doesn't create a new buffer for each line. Bye, bearophile
Jul 31 2013
prev sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 07/31/2013 02:31 PM, bearophile wrote:
 If this resulting array is used only for a short period of time:
 
 return sequence.map!(x => x.func).array;
Just to compare, I tried rewriting the cacheing version of the neighbours() function to use this kind of sequence. Here's the original code: https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L358-L379 ... and here's the rewritten version: auto neighbours(immutable size_t v) { if (!_cacheNeighbours[v]) { iota(_sumTail[v], _sumTail[v + 1]) .map!(a => _head[_indexTail[a]]) .copy(_neighbours[_sumTail[v] + _sumHead[v] .. _sumHead[v] + _sumTail[v + 1]]); iota(_sumHead[v], _sumHead[v + 1]) .map!(a => _tail[_indexHead[a]]) .copy(_neighbours[_sumHead[v] + _sumTail[v + 1] .. _sumTail[v + 1] + _sumHead[v + 1]]); _cacheNeighbours[v] = true; } return _neighbours[_sumTail[v] + _sumHead[v] .. _sumTail[v + 1] + _sumHead[v + 1]]; } Although it's faster than the non-cached map-based version, there's still a substantial performance hit from a large number of small allocations. So, your critiques of component programming certainly have merit on the performance side. What's weird is that even though the graph is always passed around by ref, meaning that the neighbours are calculated only once, according to callgrind the number of calls to _d_allocmemory still amounts to no. of vertices * no. of trial runs. And yes, I put in some checks to make sure that the loop was only being called the correct number of times.
Aug 01 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 So, your critiques of component programming certainly
 have merit on the performance side.
I have not critiqued linear flow programming its, I love it and perhaps I was the one that has introduced it in D. In the main D newsgroup I have just said that in D it still causes some performance loss, even using ldc2. Bye, bearophile
Aug 01 2013
parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 08/01/2013 01:56 PM, bearophile wrote:
 I have not critiqued linear flow programming its, I love it and perhaps I was
 the one that has introduced it in D. In the main D newsgroup I have just said
 that in D it still causes some performance loss, even using ldc2.
That's what I meant -- your remarks on the (current) performance costs. Wasn't meaning to imply you'd critiqued the idea per se!
Aug 01 2013
prev sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 07/31/2013 01:02 AM, bearophile wrote:
 Also try to use a bitvector from Phobos instead of bool[].
I take it you mean std.bitmanip.BitArray ... ? I ran into some interesting problems which are probably down to incomplete implementation: e.g. myBitArray.length += n; ... results in a compiler error message: myBitArray.length() is not an lvalue BitArrays are also not sliceable, it seems. I'm guessing that's simply failure to implement certain interfaces, rather than an unavoidable problem ... ?
Jul 31 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 I take it you mean std.bitmanip.BitArray ... ?
Right.
 I ran into some interesting problems which are probably down to 
 incomplete
 implementation: e.g.

     myBitArray.length += n;

 ... results in a compiler error message: myBitArray.length() is 
 not an lvalue
This is just a missing feature. Someone should implement it. std.bitmanip.BitArray lacks several small features.
 BitArrays are also not sliceable, it seems.
This is a bit more complex to implement in BitArray. One way to implement it is to copy all the sliced data, with no alignment, and create a new BitArray. But this is semantically different from the regular array slicing (that doesn't copy data). If you don't want to copy data then you have to add to BitArray a field (one ubyte one uint, if you use an ubyte it needs a static if for static safety) that represents the 0..(size_T.sizeof * 8) offset. But this could slow down a bit the access of single bits. Bye, bearophile
Jul 31 2013