digitalmars.D.learn - popFront causing more memory to be used
- ixid (27/27) Jul 02 2012 I'm not sure what's going on here. Using this code to solve a
- Era Scarecrow (11/38) Jul 02 2012 Depending on the size, it definitely would happen anyways.
- ixid (5/5) Jul 03 2012 That fixed the memory behaviour and is faster. Not using popFront
- bearophile (10/13) Jul 03 2012 I think the GC can't deallocate part of an array. It's a single
- ixid (3/3) Jul 03 2012 It's a pity that the D standard library seems to lack rather a
- bearophile (106/109) Jul 03 2012 I've taken a better look at your code, for this program a deque
- ixid (16/16) Jul 03 2012 Thank you, that was faster. I decided to try the obvious method
- bearophile (10/12) Jul 03 2012 This is becoming a "fixed size circular queue". But maybe a
- bearophile (28/30) Jul 03 2012 I have tried, and it's slower both in D and C. I don't know why.
- ixid (5/5) Jul 03 2012 I tested that, modulus is slower. The compiler is surely
- ixid (7/7) Jul 03 2012 In any case with large values of k the branch prediction will be
- bearophile (6/14) Jul 03 2012 I don't fully understand the question. Do you mean annotations
- ixid (10/24) Jul 03 2012 If
- bearophile (9/11) Jul 03 2012 Sorry I meant:
- ixid (2/2) Jul 03 2012 Oops! I have a bad habit of thinking of the power operator as a
- Jonathan M Davis (3/10) Jul 03 2012 http://dlang.org/d-array-article.html
I'm not sure what's going on here. Using this code to solve a reddit programming challenge to return the Nth term of a Fibonacci-like sequence of arbitrary length: module main; import std.stdio, std.array, std.datetime; ulong f(int k, int n) { ulong[] nums = new ulong[k]; nums[$ - 1] = 1; ulong total = 1; foreach(i;k..n + 1) { nums ~= total % 10^^8; total += nums[$ - 1] - nums[$ - k - 1]; nums.popFront(); //The optional, trouble line } return nums[$ - 1]; } void main() { StopWatch sw; sw.start; f(100, 10_000).writeln(); f(3^^13, 5^^10).writeln; sw.peek.msecs.writeln("msecs"); } I assumed using popFront to keep the memory use in check would be the sensible thing to do but memory used goes from 170MB to 250MB when adding popFront. What's going on and how should I have built this data structure?
Jul 02 2012
On Tuesday, 3 July 2012 at 02:20:23 UTC, ixid wrote:I'm not sure what's going on here. Using this code to solve a reddit programming challenge to return the Nth term of a Fibonacci-like sequence of arbitrary length: module main; import std.stdio, std.array, std.datetime; ulong f(int k, int n) { ulong[] nums = new ulong[k]; nums[$ - 1] = 1; ulong total = 1; foreach(i;k..n + 1) { nums ~= total % 10^^8; total += nums[$ - 1] - nums[$ - k - 1]; nums.popFront(); //The optional, trouble line } return nums[$ - 1]; } void main() { StopWatch sw; sw.start; f(100, 10_000).writeln(); f(3^^13, 5^^10).writeln; sw.peek.msecs.writeln("msecs"); } I assumed using popFront to keep the memory use in check would be the sensible thing to do but memory used goes from 170MB to 250MB when adding popFront. What's going on and how should I have built this data structure?Depending on the size, it definitely would happen anyways. Correct me if I'm wrong, but once you use popFront, you effectively modify the source to contain a slice of it's original data. If you want to extend that data, the slice can't be expanded safely since it doesn't own the whole memory. Try this: (Tested, gives 80Mb and 400ms faster) replace:ulong[] nums = new ulong[k]; with: ulong[] nums; nums.reserve(k + n + 1); nums.length = k;
Jul 02 2012
That fixed the memory behaviour and is faster. Not using popFront at all uses the same memory and is faster still. Given popFront is advancing a range does this mean the underlying array is not being deleted? How would one delete the information you don't need any more if so?
Jul 03 2012
ixid:Given popFront is advancing a range does this mean the underlying array is not being deleted? How would one delete the information you don't need any more if so?I think the GC can't deallocate part of an array. It's a single memory zone, and it has to go all at once. Maybe you need a fitter data structure, like a deque, like the Python collections.deque: http://docs.python.org/library/collections.html#collections.deque Or similar data structure of C++: http://www.cplusplus.com/reference/stl/deque/ Bye, bearophile
Jul 03 2012
It's a pity that the D standard library seems to lack rather a lot of these things. Permutation functions are also an annoying one not to have. =/
Jul 03 2012
ixid:It's a pity that the D standard library seems to lack rather a lot of these things.I've taken a better look at your code, for this program a deque wasn't so needed. This is quick: import std.stdio; ulong hFib(in uint k, in uint n) pure nothrow { auto nums = new ulong[k + n + 1]; nums[k - 1] = 1; ulong total = 1; foreach (i; k .. n + 1) { nums[i] = total % (10 ^^ 8); total += nums[i] - nums[i - k]; } return nums[n]; } void main() { hFib(100, 10_000).writeln(); hFib(3 ^^ 13, 5 ^^ 10).writeln(); } In C: #include <stdio.h> #include <stdlib.h> #include <stdint.h> uint64_t hFib(const int k, const int n) { uint64_t *nums = calloc(k + n + 1, sizeof(uint64_t)); if (nums == NULL) return 0; nums[k - 1] = 1; uint64_t total = 1; for (int i = k; i < n + 1; i++) { nums[i] = total % 100000000LLU; total += nums[i] - nums[i - k]; } return nums[n]; } int main() { printf("%llu\n", hFib(100, 10000)); printf("%llu\n", hFib(1594323, 9765625)); return 0; } DMD 32 bit: L5A: mov EDX,020h[ESP] mov EAX,01Ch[ESP] mov EBX,05F5E100h push ESI xor ECX,ECX push EDI call near ptr __ULDIV pop EDI pop ESI mov EDX,018h[ESP] mov EAX,ESI mov [ESI*8][EDX],EBX sub EAX,EDI mov 4[ESI*8][EDX],ECX sub EBX,[EAX*8][EDX] sbb ECX,4[EAX*8][EDX] add 01Ch[ESP],EBX adc 020h[ESP],ECX inc ESI cmp ESI,EBP jb L5A GCC 4.7.1: L6: movl %esi, (%esp) movl %edi, 4(%esp) movl $100000000, 8(%esp) movl $0, 12(%esp) call ___umoddi3 movl %eax, (%ebx,%ebp,8) movl %edx, 4(%ebx,%ebp,8) subl (%ebx), %eax sbbl 4(%ebx), %edx addl %eax, %esi adcl %edx, %edi addl $8, %ebx cmpl 24(%esp), %ebx jne L6 LLVM: Header: Depth=1 movl %esi, 4(%esp) movl %edi, (%esp) movl $0, 12(%esp) calll __umoddi3 movl %eax, (%ecx,%ebp,8) movl %edx, 4(%ecx,%ebp,8) addl %edi, %eax adcl %esi, %edx subl -800(%ecx,%ebp,8), %eax sbbl -796(%ecx,%ebp,8), %edx addl $1, %ebp adcl $0, %ebx movl %eax, %edi movl %edx, %esi jne .LBB0_2 Run-time, D 0.75 seconds (dmd 2.060alpha 32 bit, -O - release -inline), C 0.40 seconds (gcc 4.7.1 32 bit, -Ofast -flto -S -std=c99).Permutation functions are also an annoying one not to have. =/Try this: http://rosettacode.org/wiki/Permutations#Faster_Lazy_Version Bye, bearophile
Jul 03 2012
Thank you, that was faster. I decided to try the obvious method of reusing the same array giving similar speed to your suggestion while using less memory: ulong f2(int k, int n) { auto nums = new ulong[k + 1]; nums[$ - 1] = 1; int iter = k; ulong total = 0; foreach(i;k..n + 1) { int iter_next = iter + 1 > k? 0 : iter + 1; total += nums[iter] - nums[iter_next]; nums[iter_next] = total % 10^^8; iter = iter_next; } return nums[iter]; }
Jul 03 2012
ixid:int iter_next = iter + 1 > k? 0 : iter + 1;This is becoming a "fixed size circular queue". But maybe a modulus is faster than a branch here. (It's better when k is always a power of two, you don't need a modulus. And even better if your size is a multiple of the page size).nums[iter_next] = total % 10^^8;In such cases I suggest to add (), to not force the person that reads the code reader to remember the precedence of uncommon operators. Bye, bearophile
Jul 03 2012
This is becoming a "fixed size circular queue". But maybe a modulus is faster than a branch here.I have tried, and it's slower both in D and C. I don't know why. #include <stdio.h> #include <stdlib.h> #include <stdint.h> uint64_t hFib(const uint32_t k, const uint32_t n) { uint64_t *nums = calloc(k + 1, sizeof(uint64_t)); if (nums == NULL) exit(1); uint32_t iter = k; nums[k] = 1; uint64_t total = 1; for (uint32_t i = k; i < n + 1; i++) { const uint32_t iter_next = (iter + 1 > k) ? 0 : (iter + 1); total += nums[iter] - nums[iter_next]; nums[iter_next] = total % 100000000LLU; iter = iter_next; } return nums[iter]; } int main() { printf("%llu\n", hFib(100, 10000)); printf("%llu\n", hFib(1594323, 9765625)); return 0; } Runtime of this C version: about 240 ms. Bye, bearophile
Jul 03 2012
I tested that, modulus is slower. The compiler is surely converting it to something branchless like: uint iter_next = (iter + 1) * !(iter + 1 > k); I take your point but I think most people know that the equals operators have the lowest associativity.
Jul 03 2012
In any case with large values of k the branch prediction will be right almost all of the time, explaining why this form is faster than modulo as modulo is fairly slow while this is a correctly predicted branch doing an addition if it doesn't make it branchless. The branchless version gives the same time result as branched, is there a way to force that line not to optimized to compare the predicted version?
Jul 03 2012
ixid:In any case with large values of k the branch prediction will be right almost all of the time, explaining why this form is faster than modulo as modulo is fairly slow while this is a correctly predicted branch doing an addition if it doesn't make it branchless.That seems the explanation.The branchless version gives the same time result as branched, is there a way to force that line not to optimized to compare the predicted version?I don't fully understand the question. Do you mean annotations like the __builtin_expect of GCC? Bye, bearophile
Jul 03 2012
On Tuesday, 3 July 2012 at 17:25:18 UTC, bearophile wrote:ixid:If uint iter_next = iter + 1 > k? 0 : iter + 1; is getting optimized to uint iter_next = (iter + 1) * !(iter + 1 > k); or something like it by the compiler then it would be nice to be able to test the branched code without having the rest of the program lose optimizations for speed because as I said, for large k branching will almost always be correctly predicted making me think it'd be faster than the branchless version.In any case with large values of k the branch prediction will be right almost all of the time, explaining why this form is faster than modulo as modulo is fairly slow while this is a correctly predicted branch doing an addition if it doesn't make it branchless.That seems the explanation.The branchless version gives the same time result as branched, is there a way to force that line not to optimized to compare the predicted version?I don't fully understand the question. Do you mean annotations like the __builtin_expect of GCC? Bye, bearophile
Jul 03 2012
ixid:I take your point but I think most people know that the equals operators have the lowest associativity.Sorry I meant: nums[iter_next] = total % (10 ^^ 8); Instead of: nums[iter_next] = total % 10^^8; But I presume lot of people know that powers are higher precedence :-) Bye, bearophile
Jul 03 2012
Oops! I have a bad habit of thinking of the power operator as a part of the value rather than as an operator.
Jul 03 2012
On Tuesday, July 03, 2012 15:29:51 bearophile wrote:ixid:http://dlang.org/d-array-article.html - Jonathan M DavisGiven popFront is advancing a range does this mean the underlying array is not being deleted? How would one delete the information you don't need any more if so?I think the GC can't deallocate part of an array. It's a single memory zone, and it has to go all at once.
Jul 03 2012