www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Implementing and optimizing a simple graph metric

reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
Hello all,

I thought I'd do a writeup of the process of implementing and 
optimizing one of the graph metrics in Dgraph, starting from a 
fairly straight copy of pseudo-code in a research paper all 
through the various incremental tweaks that improve performance.
http://braingam.es/2013/09/betweenness-centrality-in-dgraph/

I guess the optimizations made in this process are trivial for 
anyone experienced in D, but I hope the detailed description of 
what changes produce useful speedups might be useful for people 
new to the language.

{Enj,Destr}oy :-)

Best wishes,

     -- Joe
Sep 18 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:
 http://braingam.es/2013/09/betweenness-centrality-in-dgraph/
Some small improvements: T[] betweenness(Graph, T = double)(ref Graph g, bool[] ignore = null) if (isGraph!Graph && isFloatingPoint!T) { enum T T0 = 0; enum T T1 = 1; auto centrality = minimallyInitializedArray!(typeof(return))(g.vertexCount); centrality[] = T0; auto stack = new size_t[g.vertexCount]; auto sigma = minimallyInitializedArray!T(g.vertexCount); sigma[] = T0; auto delta = minimallyInitializedArray!T(g.vertexCount); delta[] = T0; auto d = minimallyInitializedArray!long(g.vertexCount); d[] = -1; auto q = VertexQueue(g.vertexCount); auto p = new Appender!(size_t[])[g.vertexCount]; foreach (immutable s; 0 .. g.vertexCount) { if (ignore && ignore[s]) { continue; } size_t stackLength = 0; assert(q.empty); sigma[s] = T1; d[s] = 0; q.push(s); while (!q.empty) { immutable size_t v = q.front; q.pop; stack[stackLength] = v; ++stackLength; foreach (immutable w; g.neighboursOut(v)) { if (ignore && ignore[w]) { continue; } if (d[w] < 0) { q.push(w); d[w] = d[v] + 1; assert(sigma[w] == T0); sigma[w] = sigma[v]; p[w].clear; p[w].put(v); } else if (d[w] > (d[v] + 1)) { /* w has already been encountered, but we've found a shorter path to the source. This is probably only relevant to the weighted case, but let's keep it here in order to be ready for that update. */ d[w] = d[v] + 1L; sigma[w] = sigma[v]; p[w].clear; p[w].put(v); } else if (d[w] == (d[v] + 1)) { sigma[w] += sigma[v]; p[w].put(v); } } } while (stackLength > 0) { --stackLength; immutable w = stack[stackLength]; foreach (immutable v; p[w].data) { delta[v] += ((sigma[v] / sigma[w]) * (T1 + delta[w])); } if (w != s) { centrality[w] += delta[w]; } sigma[w] = T0; delta[w] = T0; d[w] = -1; } } static if (!g.directed) { centrality[] /= 2; } return centrality; } Just for a test, try to allocate all those arrays in a different way: - First try a std.array.Array using the LDC2 compiler; - Another thing to try is to allocate them on the stack using core.stdc.stdlib.alloca: auto p = cast(T*)alloca(T.sizeof * g.vertexCount); if (!p) throw new MemoryError... auto centrality = p[0 .. g.vertexCount]; This probably can't be used for very large graphs, so you have to add even more logic to allocate the arrays on the C heap if they are too much large. This could be useful if you call betweenness() many many times. - Try to optionally accept the buffers from outside. Bye, bearophile
Sep 18 2013
next sibling parent reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Wednesday, 18 September 2013 at 13:39:29 UTC, bearophile wrote:
 Just for a test, try to allocate all those arrays in a 
 different way:
 - First try a std.array.Array using the LDC2 compiler;
 - Another thing to try is to allocate them on the stack using 
 core.stdc.stdlib.alloca:
 auto p = cast(T*)alloca(T.sizeof * g.vertexCount);
 if (!p) throw new MemoryError...
 auto centrality = p[0 .. g.vertexCount];
 This probably can't be used for very large graphs, so you have 
 to add even more logic to allocate the arrays on the C heap if 
 they are too much large. This could be useful if you call 
 betweenness() many many times.
 - Try to optionally accept the buffers from outside.
Nice suggestions, I'll try those out! :-) I think I did give std.array.Array a trial when trying to speed up its performance, and I don't remember it making any difference (if anything it may have slowed things down). But I'll give it a second look and report back. I haven't yet tried alloca or other manual memory management -- I felt a bit resistant to this as I'd prefer to keep the code simple and readable -- but I'll give that a go too just to see how it goes. It's interesting that you pick up on my use of to!T(0) etc. I had some concern about whether that would affect speed at all. At the time of originally writing the code, my impression was that not having it actually made things worse, and that the compiler was smart enough to carry out the conversion at compile-time. However, my impression now is that having or not having it, or any other method (e.g. enum T0, T1, etc.) makes absolutely no difference, and that I might as well be simple and write 0 for integral-types, 0.0 for floating-point.
Sep 18 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 I haven't yet tried alloca or other manual memory management -- 
 I felt a bit resistant to this as I'd prefer to keep the code 
 simple and readable -- but I'll give that a go too just to see 
 how it goes.
I'd like some stack-allocated variable-length array in D+Phobos, as in C++14. It's also a basis to build several other stack-allocated data structures.
 However, my impression now is that having or not having it, or 
 any other method (e.g. enum T0, T1, etc.) makes absolutely no 
 difference, and that I might as well be simple and write 0 for 
 integral-types, 0.0 for floating-point.
Right. And this could be right even if you want T=Rational. Bye, bearophile
Sep 18 2013
parent reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Wednesday, 18 September 2013 at 15:22:51 UTC, bearophile wrote:
 Joseph Rushton Wakeling:

 I haven't yet tried alloca or other manual memory management 
 -- I felt a bit resistant to this as I'd prefer to keep the 
 code simple and readable -- but I'll give that a go too just 
 to see how it goes.
I'd like some stack-allocated variable-length array in D+Phobos, as in C++14. It's also a basis to build several other stack-allocated data structures.
Yes, that'd be useful, although in the case of this code I think that stack vs. heap probably isn't that important. If the data is small enough for stack allocation, the calculation will be quick anyway.
 However, my impression now is that having or not having it, or 
 any other method (e.g. enum T0, T1, etc.) makes absolutely no 
 difference, and that I might as well be simple and write 0 for 
 integral-types, 0.0 for floating-point.
Right. And this could be right even if you want T=Rational.
Actually, on re-testing this just now, I'm returning to my original view. I find that if you put in raw floating-point numbers like 0.0, 1.0 etc. the resulting code can be slower in the case where T = float. Putting to!T or using enums as you suggest appears to make no difference. This is a guess rather than confirmed by looking at assembly, but I'm presuming that to!T(0) and other conversions of compile-time constants are evaluated at compile time, so you don't in practice carry the cost you'd normally get from using to().
Sep 18 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 in the case of this code I think that stack vs. heap probably 
 isn't that important.  If the data is small enough for stack 
 allocation, the calculation will be quick anyway.
How many times or how often do you need to call betweenness()? If it's called only few times or once in a while then using the GC is good enough. But if you have to call it many millions of times, all those GC array allocations could cost some time, even if they are small arrays. In this case allocating them on the stack (or not allocating them, using externally allocated buffers) helps. Bye, bearophile
Sep 18 2013
parent "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Wednesday, 18 September 2013 at 17:13:28 UTC, bearophile wrote:
 How many times or how often do you need to call betweenness()? 
 If it's called only few times or once in a while then using the 
 GC is good enough. But if you have to call it many millions of 
 times, all those GC array allocations could cost some time, 
 even if they are small arrays. In this case allocating them on 
 the stack (or not allocating them, using externally allocated 
 buffers) helps.
I think millions of times is excessive. In my test code I have an example with a 50-node network calculating betweenness centrality 10000 times, which takes under 1.5 seconds. So obviously if you scale up to millions, small improvements could make a difference, but it's not going to be on such a scale that it's worth prioritizing, at least for now. I think even 10000 calculations is excessive for any practical case I can think of. You're right to call for the opportunity to pass a buffer, though, and I've enabled that. In the current test code it doesn't seem to improve anything, but that may be simply a matter of scale.
Sep 19 2013
prev sibling parent "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Wednesday, 18 September 2013 at 15:17:25 UTC, Joseph Rushton 
Wakeling wrote:
 I think I did give std.array.Array a trial when trying to speed 
 up its performance, and I don't remember it making any 
 difference (if anything it may have slowed things down).  But 
 I'll give it a second look and report back.
I tried rewriting the variable declarations: Array!T centrality; centrality.length = g.vertexCount; centrality[] = to!T(0); Array!size_t stack; stack.length = g.vertexCount; Array!T sigma; sigma.length = g.vertexCount; Array!T delta; delta.length = g.vertexCount; Array!long d; d.length = g.vertexCount; auto q = VertexQueue(g.vertexCount); Appender!(size_t[])[] p = new Appender!(size_t[])[g.vertexCount]; It results in a moderate slowdown of the code, at a guess because it increases the total number of mallocs/deallocs. I note that there seems to be no way to initialize std.container.Array as having a certain length ... ? I tried instead using std.array.uninitializedArray to initialize the arrays in the betweenness() function -- it was a bit difficult to judge here: with a relatively small number of calls to betweenness() it seems to have resulted in a speedup, but with very many calls, it seems to have been slower.
Sep 18 2013
prev sibling next sibling parent "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Wednesday, 18 September 2013 at 13:39:29 UTC, bearophile wrote:
 - Try to optionally accept the buffers from outside.
Does this look good to you? ///////////////////////////////////////////////// auto ref betweenness(T = double, Graph)(ref Graph g, bool[] ignore = null) if (isFloatingPoint!T && isGraph!Graph) { T[] centrality = new T[g.vertexCount]; return betweenness!(T, Graph)(g, centrality, ignore); } auto ref betweenness(T = double, Graph)(ref Graph g, ref T[] centrality, bool[] ignore = null) { centrality.length = g.vertexCount; centrality[] = to!T(0); // ... the rest as before } /////////////////////////////////////////////////
Sep 18 2013
prev sibling parent reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Wednesday, 18 September 2013 at 13:39:29 UTC, bearophile wrote:
     auto centrality = 
 minimallyInitializedArray!(typeof(return))(g.vertexCount);
     centrality[] = T0;
     auto stack = new size_t[g.vertexCount];
     auto sigma = minimallyInitializedArray!T(g.vertexCount);
     sigma[] = T0;
     auto delta = minimallyInitializedArray!T(g.vertexCount);
     delta[] = T0;
     auto d = minimallyInitializedArray!long(g.vertexCount);
     d[] = -1;
     auto q = VertexQueue(g.vertexCount);
     auto p = new Appender!(size_t[])[g.vertexCount];
As an experiment I tried something along these lines -- using uninitializedArray for most of the arrays here and minimallyInitializedArray for p. It seems to result in a small speed improvement, although I'm not certain about that -- certainly not a speed loss. On the other hand, if inside the VertexQueue implementation, I replace the "new" declaration with an uninitializedArray call, the code gets slower by a noticeable amount. Very odd -- any ideas why that might be?
Sep 24 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 As an experiment I tried something along these lines -- using 
 uninitializedArray for most of the arrays here and 
 minimallyInitializedArray for p.
minimallyInitializedArray is not stupid, if the specified type has no indirections, it's equivalent to using uninitializedArray, but it's safer if you later change the type. So in general it's not a good idea to use uninitializedArray, unless you have special needs. The two functions are not equivalent, one of them is for normal performance tuning, and the other is for special usages.
 On the other hand, if inside the VertexQueue implementation, I 
 replace the "new" declaration with an uninitializedArray call, 
 the code gets slower by a noticeable amount.

 Very odd -- any ideas why that might be?
See above, use uninitializedArray only in special situations and when you know what you are doing. Here you do not know what you are doing, so use minimallyInitializedArray. uninitializedArray creates random pointers, and aliases that could increase the work done by the GC and cause (temporary if you initialized all your data) memory leaks. Try to totally disable the GC and time the two versions of the code, with and without uninitializedArray. If the GC is the cause of speed differences and you disable it, you will see no performance difference any more. Bye, bearophile
Sep 24 2013
parent reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Tuesday, 24 September 2013 at 22:14:30 UTC, bearophile wrote:
 minimallyInitializedArray is not stupid, if the specified type 
 has no indirections, it's equivalent to using 
 uninitializedArray, but it's safer if you later change the 
 type. So in general it's not a good idea to use 
 uninitializedArray, unless you have special needs. The two 
 functions are not equivalent, one of them is for normal 
 performance tuning, and the other is for special usages.
I have not found this -- using minimallyInitializedArray for the arrays of built-in types is slower than if I use uninitializedArray. These arrays have their values initialized immediately -- this is the actual code from inside the betweenness centrality function: size_t[] stack = uninitializedArray!(size_t[])(g.vertexCount); T[] sigma = uninitializedArray!(T[])(g.vertexCount); T[] delta = uninitializedArray!(T[])(g.vertexCount); long[] d = uninitializedArray!(long[])(g.vertexCount); auto q = VertexQueue(g.vertexCount); Appender!(size_t[])[] p = minimallyInitializedArray!(Appender!(size_t[])[])(g.vertexCount); sigma[] = to!T(0); delta[] = to!T(0); d[] = -1L; ... so I don't see the safety problem here. uninitializedArray is used only for arrays of built-in types, minimallyInitializedArray is used otherwise.
 On the other hand, if inside the VertexQueue implementation, I 
 replace the "new" declaration with an uninitializedArray call, 
 the code gets slower by a noticeable amount.

 Very odd -- any ideas why that might be?
See above, use uninitializedArray only in special situations and when you know what you are doing. Here you do not know what you are doing, so use minimallyInitializedArray.
uninitializedArray or minimallyInitializedArray, using either inside the VertexQueue creates a significant slowdown.
 uninitializedArray creates random pointers, and aliases that 
 could increase the work done by the GC and cause (temporary if 
 you initialized all your data) memory leaks. Try to totally 
 disable the GC and time the two versions of the code, with and 
 without uninitializedArray. If the GC is the cause of speed 
 differences and you disable it, you will see no performance 
 difference any more.
You seem to be suggesting that using uninitializedArray could cause general slowdown, but in general it results in a speedup compared to minimallyInitializedArray. In the one case where it causes a slowdown, minimallyInitializedArray does too, and by a similar amount.
Sep 26 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 I have not found this -- using minimallyInitializedArray for 
 the arrays of built-in types is slower than if I use 
 uninitializedArray.
Then minimallyInitializedArray should be improved :-)
 You seem to be suggesting that using uninitializedArray could 
 cause general slowdown,
I am saying that in general uninitializedArray is less safe. Bye, bearophile
Sep 26 2013
parent reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Thursday, 26 September 2013 at 20:56:39 UTC, bearophile wrote:
 Joseph Rushton Wakeling:

 I have not found this -- using minimallyInitializedArray for 
 the arrays of built-in types is slower than if I use 
 uninitializedArray.
Then minimallyInitializedArray should be improved :-)
It's odd, because the two actually use the same underlying function, with minimallyInitializedArray potentially triggering this section: else static if(minimallyInitialized && hasIndirections!E) { ret[] = E.init; } ... but if E is (say) an int, or a size_t, or any other built-in type, this should not be triggered -- right? So it's bizarre that there is a performance difference here. Incidentally, this was why I felt safe using uninitializedArray for arrays of e.g. long, size_t, etc., because the only difference at the end of the function could be whether the values contained in the array have been set or not. As I do ensure that any array entry used also has its value set, it should be OK.
Sep 26 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 So it's bizarre that there is a performance difference here.
Bizarre findings deserve more experiments and studies :-)
 Incidentally, this was why I felt safe using uninitializedArray 
 for arrays of e.g. long, size_t, etc., because the only 
 difference at the end of the function could be whether the 
 values contained in the array have been set or not.  As I do 
 ensure that any array entry used also has its value set, it 
 should be OK.
You also have arrays of T. Someday T could be something with indirections :-) So minimallyInitializedArray is safer regarding future changes in your code. Bye, bearophile
Sep 26 2013
parent reply "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Thursday, 26 September 2013 at 21:29:42 UTC, bearophile wrote:
 You also have arrays of T. Someday T could be something with 
 indirections :-) So minimallyInitializedArray is safer 
 regarding future changes in your code.
T is qualified via isFloatingPoint :-)
Sep 26 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Joseph Rushton Wakeling:

 T is qualified via isFloatingPoint :-)
I know, but that qualification could change in future evolutions of your code. Strong type safety means that if you change a type in your code, with a localized change (like removing isFloatingPoint at the top of your function) the whole code that depends on that type (like the body of this function of yours) will keep working as safely as before :-) Bye, bearophile
Sep 26 2013
parent "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Thursday, 26 September 2013 at 22:03:12 UTC, bearophile wrote:
 Joseph Rushton Wakeling:

 T is qualified via isFloatingPoint :-)
I know, but that qualification could change in future evolutions of your code. Strong type safety means that if you change a type in your code, with a localized change (like removing isFloatingPoint at the top of your function) the whole code that depends on that type (like the body of this function of yours) will keep working as safely as before :-)
OK, I accept your argument :-) As things stand I'm inclined to leave the code using "new" for now. I'll see if I can work out anything about why minimallyInitializedArray might be problematic -- at a guess, perhaps it accidentally gets in the way of some potential optimizations?
Sep 27 2013