www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - popFront causing more memory to be used

reply "ixid" <nuaccount gmail.com> writes:
I'm not sure what's going on here. Using this code to solve a 
reddit programming challenge to return the Nth term of a 
Fibonacci-like sequence of arbitrary length:

     module main;
     import std.stdio, std.array, std.datetime;

     ulong f(int k, int n) {
         ulong[] nums = new ulong[k];
         nums[$ - 1] = 1;
         ulong total = 1;
         foreach(i;k..n + 1) {
             nums ~= total % 10^^8;
             total += nums[$ - 1] - nums[$ - k - 1];
             nums.popFront(); //The optional, trouble line
         }
         return nums[$ - 1];
     }

     void main() {
         StopWatch sw;
         sw.start;
         f(100, 10_000).writeln();
         f(3^^13, 5^^10).writeln;
         sw.peek.msecs.writeln("msecs");
     }

I assumed using popFront to keep the memory use in check would be 
the sensible thing to do but memory used goes from 170MB to 250MB 
when adding popFront. What's going on and how should I have built 
this data structure?
Jul 02 2012
next sibling parent "Era Scarecrow" <rtcvb32 yahoo.com> writes:
On Tuesday, 3 July 2012 at 02:20:23 UTC, ixid wrote:
 I'm not sure what's going on here. Using this code to solve a 
 reddit programming challenge to return the Nth term of a 
 Fibonacci-like sequence of arbitrary length:

     module main;
     import std.stdio, std.array, std.datetime;

     ulong f(int k, int n) {
         ulong[] nums = new ulong[k];
         nums[$ - 1] = 1;
         ulong total = 1;
         foreach(i;k..n + 1) {
             nums ~= total % 10^^8;
             total += nums[$ - 1] - nums[$ - k - 1];
             nums.popFront(); //The optional, trouble line
         }
         return nums[$ - 1];
     }

     void main() {
         StopWatch sw;
         sw.start;
         f(100, 10_000).writeln();
         f(3^^13, 5^^10).writeln;
         sw.peek.msecs.writeln("msecs");
     }

 I assumed using popFront to keep the memory use in check would 
 be the sensible thing to do but memory used goes from 170MB to 
 250MB when adding popFront. What's going on and how should I 
 have built this data structure?

Depending on the size, it definitely would happen anyways. Correct me if I'm wrong, but once you use popFront, you effectively modify the source to contain a slice of it's original data. If you want to extend that data, the slice can't be expanded safely since it doesn't own the whole memory. Try this: (Tested, gives 80Mb and 400ms faster) replace:ulong[] nums = new ulong[k]; with: ulong[] nums; nums.reserve(k + n + 1); nums.length = k;
Jul 02 2012
prev sibling next sibling parent "ixid" <nuaccount gmail.com> writes:
That fixed the memory behaviour and is faster. Not using popFront 
at all uses the same memory and is faster still. Given popFront 
is advancing a range does this mean the underlying array is not 
being deleted? How would one delete the information you don't 
need any more if so?
Jul 03 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
ixid:

 Given popFront is advancing a range does this mean the 
 underlying array is not being deleted? How would one delete the 
 information you don't need any more if so?

I think the GC can't deallocate part of an array. It's a single memory zone, and it has to go all at once. Maybe you need a fitter data structure, like a deque, like the Python collections.deque: http://docs.python.org/library/collections.html#collections.deque Or similar data structure of C++: http://www.cplusplus.com/reference/stl/deque/ Bye, bearophile
Jul 03 2012
prev sibling next sibling parent "ixid" <nuaccount gmail.com> writes:
It's a pity that the D standard library seems to lack rather a 
lot of these things. Permutation functions are also an annoying 
one not to have. =/
Jul 03 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
ixid:
 It's a pity that the D standard library seems to lack rather a 
 lot of these things.

I've taken a better look at your code, for this program a deque wasn't so needed. This is quick: import std.stdio; ulong hFib(in uint k, in uint n) pure nothrow { auto nums = new ulong[k + n + 1]; nums[k - 1] = 1; ulong total = 1; foreach (i; k .. n + 1) { nums[i] = total % (10 ^^ 8); total += nums[i] - nums[i - k]; } return nums[n]; } void main() { hFib(100, 10_000).writeln(); hFib(3 ^^ 13, 5 ^^ 10).writeln(); } In C: #include <stdio.h> #include <stdlib.h> #include <stdint.h> uint64_t hFib(const int k, const int n) { uint64_t *nums = calloc(k + n + 1, sizeof(uint64_t)); if (nums == NULL) return 0; nums[k - 1] = 1; uint64_t total = 1; for (int i = k; i < n + 1; i++) { nums[i] = total % 100000000LLU; total += nums[i] - nums[i - k]; } return nums[n]; } int main() { printf("%llu\n", hFib(100, 10000)); printf("%llu\n", hFib(1594323, 9765625)); return 0; } DMD 32 bit: L5A: mov EDX,020h[ESP] mov EAX,01Ch[ESP] mov EBX,05F5E100h push ESI xor ECX,ECX push EDI call near ptr __ULDIV pop EDI pop ESI mov EDX,018h[ESP] mov EAX,ESI mov [ESI*8][EDX],EBX sub EAX,EDI mov 4[ESI*8][EDX],ECX sub EBX,[EAX*8][EDX] sbb ECX,4[EAX*8][EDX] add 01Ch[ESP],EBX adc 020h[ESP],ECX inc ESI cmp ESI,EBP jb L5A GCC 4.7.1: L6: movl %esi, (%esp) movl %edi, 4(%esp) movl $100000000, 8(%esp) movl $0, 12(%esp) call ___umoddi3 movl %eax, (%ebx,%ebp,8) movl %edx, 4(%ebx,%ebp,8) subl (%ebx), %eax sbbl 4(%ebx), %edx addl %eax, %esi adcl %edx, %edi addl $8, %ebx cmpl 24(%esp), %ebx jne L6 LLVM: .LBB0_2: # =>This Inner Loop Header: Depth=1 movl %esi, 4(%esp) movl %edi, (%esp) movl $0, 12(%esp) movl $100000000, 8(%esp) # imm = 0x5F5E100 calll __umoddi3 movl 24(%esp), %ecx # 4-byte Reload movl %eax, (%ecx,%ebp,8) movl %edx, 4(%ecx,%ebp,8) addl %edi, %eax adcl %esi, %edx subl -800(%ecx,%ebp,8), %eax sbbl -796(%ecx,%ebp,8), %edx addl $1, %ebp adcl $0, %ebx cmpl $10001, %ebp # imm = 0x2711 movl %eax, %edi movl %edx, %esi jne .LBB0_2 Run-time, D 0.75 seconds (dmd 2.060alpha 32 bit, -O - release -inline), C 0.40 seconds (gcc 4.7.1 32 bit, -Ofast -flto -S -std=c99).
 Permutation functions are also an annoying one not to have. =/

Try this: http://rosettacode.org/wiki/Permutations#Faster_Lazy_Version Bye, bearophile
Jul 03 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, July 03, 2012 15:29:51 bearophile wrote:
 ixid:
 Given popFront is advancing a range does this mean the
 underlying array is not being deleted? How would one delete the
 information you don't need any more if so?

I think the GC can't deallocate part of an array. It's a single memory zone, and it has to go all at once.

http://dlang.org/d-array-article.html - Jonathan M Davis
Jul 03 2012
prev sibling next sibling parent "ixid" <nuaccount gmail.com> writes:
Thank you, that was faster. I decided to try the obvious method 
of reusing the same array giving similar speed to your suggestion 
while using less memory:

     ulong f2(int k, int n) {
         auto nums = new ulong[k + 1];
         nums[$ - 1] = 1;
         int iter = k;
         ulong total = 0;
         foreach(i;k..n + 1) {
             int iter_next = iter + 1 > k? 0 : iter + 1;
             total += nums[iter] - nums[iter_next];
             nums[iter_next] = total % 10^^8;
             iter = iter_next;
         }

         return nums[iter];
     }
Jul 03 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
ixid:

             int iter_next = iter + 1 > k? 0 : iter + 1;

This is becoming a "fixed size circular queue". But maybe a modulus is faster than a branch here. (It's better when k is always a power of two, you don't need a modulus. And even better if your size is a multiple of the page size).
             nums[iter_next] = total % 10^^8;

In such cases I suggest to add (), to not force the person that reads the code reader to remember the precedence of uncommon operators. Bye, bearophile
Jul 03 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
 This is becoming a "fixed size circular queue". But maybe a 
 modulus is faster than a branch here.

I have tried, and it's slower both in D and C. I don't know why. #include <stdio.h> #include <stdlib.h> #include <stdint.h> uint64_t hFib(const uint32_t k, const uint32_t n) { uint64_t *nums = calloc(k + 1, sizeof(uint64_t)); if (nums == NULL) exit(1); uint32_t iter = k; nums[k] = 1; uint64_t total = 1; for (uint32_t i = k; i < n + 1; i++) { const uint32_t iter_next = (iter + 1 > k) ? 0 : (iter + 1); total += nums[iter] - nums[iter_next]; nums[iter_next] = total % 100000000LLU; iter = iter_next; } return nums[iter]; } int main() { printf("%llu\n", hFib(100, 10000)); printf("%llu\n", hFib(1594323, 9765625)); return 0; } Runtime of this C version: about 240 ms. Bye, bearophile
Jul 03 2012
prev sibling next sibling parent "ixid" <nuaccount gmail.com> writes:
I tested that, modulus is slower. The compiler is surely 
converting it to something branchless like:

     uint iter_next = (iter + 1) * !(iter + 1 > k);

I take your point but I think most people know that the equals 
operators have the lowest associativity.
Jul 03 2012
prev sibling next sibling parent "ixid" <nuaccount gmail.com> writes:
In any case with large values of k the branch prediction will be 
right almost all of the time, explaining why this form is faster 
than modulo as modulo is fairly slow while this is a correctly 
predicted branch doing an addition if it doesn't make it 
branchless. The branchless version gives the same time result as 
branched, is there a way to force that line not to optimized to 
compare the predicted version?
Jul 03 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
ixid:

 I take your point but I think most people know that the equals 
 operators have the lowest associativity.

Sorry I meant: nums[iter_next] = total % (10 ^^ 8); Instead of: nums[iter_next] = total % 10^^8; But I presume lot of people know that powers are higher precedence :-) Bye, bearophile
Jul 03 2012
prev sibling next sibling parent "ixid" <nuaccount gmail.com> writes:
Oops! I have a bad habit of thinking of the power operator as a 
part of the value rather than as an operator.
Jul 03 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
ixid:

 In any case with large values of k the branch prediction will 
 be right almost all of the time, explaining why this form is 
 faster than modulo as modulo is fairly slow while this is a 
 correctly predicted branch doing an addition if it doesn't make 
 it branchless.

That seems the explanation.
 The branchless version gives the same time result as branched, 
 is there a way to force that line not to optimized to compare 
 the predicted version?

I don't fully understand the question. Do you mean annotations like the __builtin_expect of GCC? Bye, bearophile
Jul 03 2012
prev sibling parent "ixid" <nuaccount gmail.com> writes:
On Tuesday, 3 July 2012 at 17:25:18 UTC, bearophile wrote:
 ixid:

 In any case with large values of k the branch prediction will 
 be right almost all of the time, explaining why this form is 
 faster than modulo as modulo is fairly slow while this is a 
 correctly predicted branch doing an addition if it doesn't 
 make it branchless.

That seems the explanation.
 The branchless version gives the same time result as branched, 
 is there a way to force that line not to optimized to compare 
 the predicted version?

I don't fully understand the question. Do you mean annotations like the __builtin_expect of GCC? Bye, bearophile

If uint iter_next = iter + 1 > k? 0 : iter + 1; is getting optimized to uint iter_next = (iter + 1) * !(iter + 1 > k); or something like it by the compiler then it would be nice to be able to test the branched code without having the rest of the program lose optimizations for speed because as I said, for large k branching will almost always be correctly predicted making me think it'd be faster than the branchless version.
Jul 03 2012