digitalmars.D.learn - popFront causing more memory to be used

ixid (27/27) Jul 02 2012 I'm not sure what's going on here. Using this code to solve a

Era Scarecrow (11/38) Jul 02 2012 Depending on the size, it definitely would happen anyways.

ixid (5/5) Jul 03 2012 That fixed the memory behaviour and is faster. Not using popFront

bearophile (10/13) Jul 03 2012 I think the GC can't deallocate part of an array. It's a single

ixid (3/3) Jul 03 2012 It's a pity that the D standard library seems to lack rather a

bearophile (106/109) Jul 03 2012 I've taken a better look at your code, for this program a deque

ixid (16/16) Jul 03 2012 Thank you, that was faster. I decided to try the obvious method

bearophile (10/12) Jul 03 2012 This is becoming a "fixed size circular queue". But maybe a

bearophile (28/30) Jul 03 2012 I have tried, and it's slower both in D and C. I don't know why.
ixid (5/5) Jul 03 2012 I tested that, modulus is slower. The compiler is surely

ixid (7/7) Jul 03 2012 In any case with large values of k the branch prediction will be

bearophile (6/14) Jul 03 2012 I don't fully understand the question. Do you mean annotations

ixid (10/24) Jul 03 2012 If

bearophile (9/11) Jul 03 2012 Sorry I meant:

ixid (2/2) Jul 03 2012 Oops! I have a bad habit of thinking of the power operator as a

Jonathan M Davis (3/10) Jul 03 2012 http://dlang.org/d-array-article.html

"ixid" <nuaccount gmail.com> writes:

I'm not sure what's going on here. Using this code to solve a 
reddit programming challenge to return the Nth term of a 
Fibonacci-like sequence of arbitrary length:

     module main;
     import std.stdio, std.array, std.datetime;

     ulong f(int k, int n) {
         ulong[] nums = new ulong[k];
         nums[$ - 1] = 1;
         ulong total = 1;
         foreach(i;k..n + 1) {
             nums ~= total % 10^^8;
             total += nums[$ - 1] - nums[$ - k - 1];
             nums.popFront(); //The optional, trouble line
         }
         return nums[$ - 1];
     }

     void main() {
         StopWatch sw;
         sw.start;
         f(100, 10_000).writeln();
         f(3^^13, 5^^10).writeln;
         sw.peek.msecs.writeln("msecs");
     }

I assumed using popFront to keep the memory use in check would be 
the sensible thing to do but memory used goes from 170MB to 250MB 
when adding popFront. What's going on and how should I have built 
this data structure?

Jul 02 2012

"Era Scarecrow" <rtcvb32 yahoo.com> writes:

On Tuesday, 3 July 2012 at 02:20:23 UTC, ixid wrote:
 I'm not sure what's going on here. Using this code to solve a 
 reddit programming challenge to return the Nth term of a 
 Fibonacci-like sequence of arbitrary length:

     module main;
     import std.stdio, std.array, std.datetime;

     ulong f(int k, int n) {
         ulong[] nums = new ulong[k];
         nums[$ - 1] = 1;
         ulong total = 1;
         foreach(i;k..n + 1) {
             nums ~= total % 10^^8;
             total += nums[$ - 1] - nums[$ - k - 1];
             nums.popFront(); //The optional, trouble line
         }
         return nums[$ - 1];
     }

     void main() {
         StopWatch sw;
         sw.start;
         f(100, 10_000).writeln();
         f(3^^13, 5^^10).writeln;
         sw.peek.msecs.writeln("msecs");
     }

 I assumed using popFront to keep the memory use in check would 
 be the sensible thing to do but memory used goes from 170MB to 
 250MB when adding popFront. What's going on and how should I 
 have built this data structure?

  Depending on the size, it definitely would happen anyways. 
Correct me if I'm wrong, but once you use popFront, you 
effectively modify the source to contain a slice of it's original 
data. If you want to extend that data, the slice can't be 
expanded safely since it doesn't own the whole memory.

  Try this: (Tested, gives 80Mb and 400ms faster)

replace:ulong[] nums = new ulong[k];

with:   ulong[] nums;
         nums.reserve(k + n + 1);
         nums.length = k;

Jul 02 2012

"ixid" <nuaccount gmail.com> writes:

That fixed the memory behaviour and is faster. Not using popFront 
at all uses the same memory and is faster still. Given popFront 
is advancing a range does this mean the underlying array is not 
being deleted? How would one delete the information you don't 
need any more if so?

Jul 03 2012

"bearophile" <bearophileHUGS lycos.com> writes:

ixid:

 Given popFront is advancing a range does this mean the 
 underlying array is not being deleted? How would one delete the 
 information you don't need any more if so?

I think the GC can't deallocate part of an array. It's a single 
memory zone, and it has to go all at once.

Maybe you need a fitter data structure, like a deque, like the 
Python collections.deque:
http://docs.python.org/library/collections.html#collections.deque

Or similar data structure of C++:
http://www.cplusplus.com/reference/stl/deque/

Bye,
bearophile

Jul 03 2012

"ixid" <nuaccount gmail.com> writes:

It's a pity that the D standard library seems to lack rather a 
lot of these things. Permutation functions are also an annoying 
one not to have. =/

Jul 03 2012

"bearophile" <bearophileHUGS lycos.com> writes:

ixid:
 It's a pity that the D standard library seems to lack rather a 
 lot of these things.

I've taken a better look at your code, for this program a deque 
wasn't so needed.

This is quick:

import std.stdio;

ulong hFib(in uint k, in uint n) pure nothrow {
     auto nums = new ulong[k + n + 1];
     nums[k - 1] = 1;
     ulong total = 1;
     foreach (i; k .. n + 1) {
         nums[i] = total % (10 ^^ 8);
         total += nums[i] - nums[i - k];
     }
     return nums[n];
}

void main() {
     hFib(100, 10_000).writeln();
     hFib(3 ^^ 13, 5 ^^ 10).writeln();
}


In C:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

uint64_t hFib(const int k, const int n) {
     uint64_t *nums = calloc(k + n + 1, sizeof(uint64_t));
     if (nums == NULL)
         return 0;
     nums[k - 1] = 1;
     uint64_t total = 1;
     for (int i = k; i < n + 1; i++) {
         nums[i] = total % 100000000LLU;
         total += nums[i] - nums[i - k];
     }
     return nums[n];
}

int main() {
     printf("%llu\n", hFib(100, 10000));
     printf("%llu\n", hFib(1594323, 9765625));
     return 0;
}



DMD 32 bit:

L5A:	mov	EDX,020h[ESP]
		mov	EAX,01Ch[ESP]
		mov	EBX,05F5E100h
		push ESI
		xor	ECX,ECX
		push EDI
		call near ptr __ULDIV 
		pop	EDI
		pop	ESI
		mov	EDX,018h[ESP]
		mov	EAX,ESI
		mov	[ESI*8][EDX],EBX
		sub	EAX,EDI
		mov	4[ESI*8][EDX],ECX
		sub	EBX,[EAX*8][EDX]
		sbb	ECX,4[EAX*8][EDX]
		add	01Ch[ESP],EBX
		adc	020h[ESP],ECX
		inc	ESI
		cmp	ESI,EBP
		jb	L5A


GCC 4.7.1:

L6:
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
	movl	$100000000, 8(%esp)
	movl	$0, 12(%esp)
	call	___umoddi3
	movl	%eax, (%ebx,%ebp,8)
	movl	%edx, 4(%ebx,%ebp,8)
	subl	(%ebx), %eax
	sbbl	4(%ebx), %edx
	addl	%eax, %esi
	adcl	%edx, %edi
	addl	$8, %ebx
	cmpl	24(%esp), %ebx
	jne	L6


LLVM:


Header: Depth=1
	movl	%esi, 4(%esp)
	movl	%edi, (%esp)
	movl	$0, 12(%esp)

	calll	__umoddi3

	movl	%eax, (%ecx,%ebp,8)
	movl	%edx, 4(%ecx,%ebp,8)
	addl	%edi, %eax
	adcl	%esi, %edx
	subl	-800(%ecx,%ebp,8), %eax
	sbbl	-796(%ecx,%ebp,8), %edx
	addl	$1, %ebp
	adcl	$0, %ebx

	movl	%eax, %edi
	movl	%edx, %esi
	jne	.LBB0_2


Run-time, D 0.75 seconds (dmd 2.060alpha 32 bit, -O - release 
-inline), C 0.40 seconds (gcc 4.7.1 32 bit, -Ofast -flto -S 
-std=c99).


 Permutation functions are also an annoying one not to have. =/

Try this:
http://rosettacode.org/wiki/Permutations#Faster_Lazy_Version

Bye,
bearophile

Jul 03 2012

"ixid" <nuaccount gmail.com> writes:

Thank you, that was faster. I decided to try the obvious method 
of reusing the same array giving similar speed to your suggestion 
while using less memory:

     ulong f2(int k, int n) {
         auto nums = new ulong[k + 1];
         nums[$ - 1] = 1;
         int iter = k;
         ulong total = 0;
         foreach(i;k..n + 1) {
             int iter_next = iter + 1 > k? 0 : iter + 1;
             total += nums[iter] - nums[iter_next];
             nums[iter_next] = total % 10^^8;
             iter = iter_next;
         }

         return nums[iter];
     }

Jul 03 2012

"bearophile" <bearophileHUGS lycos.com> writes:

ixid:

             int iter_next = iter + 1 > k? 0 : iter + 1;

This is becoming a "fixed size circular queue". But maybe a 
modulus is faster than a branch here. (It's better when k is 
always a power of two, you don't need a modulus. And even better 
if your size is a multiple of the page size).


             nums[iter_next] = total % 10^^8;

In such cases I suggest to add (), to not force the person that 
reads the code reader to remember the precedence of uncommon 
operators.

Bye,
bearophile

Jul 03 2012

"bearophile" <bearophileHUGS lycos.com> writes:

 This is becoming a "fixed size circular queue". But maybe a 
 modulus is faster than a branch here.

I have tried, and it's slower both in D and C. I don't know why.


#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

uint64_t hFib(const uint32_t k, const uint32_t n) {
     uint64_t *nums = calloc(k + 1, sizeof(uint64_t));
     if (nums == NULL)
         exit(1);

     uint32_t iter = k;
     nums[k] = 1;
     uint64_t total = 1;
     for (uint32_t i = k; i < n + 1; i++) {
         const uint32_t iter_next = (iter + 1 > k) ? 0 : (iter + 
1);
         total += nums[iter] - nums[iter_next];
         nums[iter_next] = total % 100000000LLU;
         iter = iter_next;
     }

     return nums[iter];
}

int main() {
     printf("%llu\n", hFib(100, 10000));
     printf("%llu\n", hFib(1594323, 9765625));
     return 0;
}


Runtime of this C version: about 240 ms.

Bye,
bearophile

Jul 03 2012

"ixid" <nuaccount gmail.com> writes:

I tested that, modulus is slower. The compiler is surely 
converting it to something branchless like:

     uint iter_next = (iter + 1) * !(iter + 1 > k);

I take your point but I think most people know that the equals 
operators have the lowest associativity.

Jul 03 2012

"ixid" <nuaccount gmail.com> writes:

In any case with large values of k the branch prediction will be 
right almost all of the time, explaining why this form is faster 
than modulo as modulo is fairly slow while this is a correctly 
predicted branch doing an addition if it doesn't make it 
branchless. The branchless version gives the same time result as 
branched, is there a way to force that line not to optimized to 
compare the predicted version?

Jul 03 2012

"bearophile" <bearophileHUGS lycos.com> writes:

ixid:

 In any case with large values of k the branch prediction will 
 be right almost all of the time, explaining why this form is 
 faster than modulo as modulo is fairly slow while this is a 
 correctly predicted branch doing an addition if it doesn't make 
 it branchless.

That seems the explanation.


 The branchless version gives the same time result as branched, 
 is there a way to force that line not to optimized to compare 
 the predicted version?

I don't fully understand the question. Do you mean annotations 
like the __builtin_expect of GCC?

Bye,
bearophile

Jul 03 2012

"ixid" <nuaccount gmail.com> writes:

On Tuesday, 3 July 2012 at 17:25:18 UTC, bearophile wrote:
 ixid:

 In any case with large values of k the branch prediction will 
 be right almost all of the time, explaining why this form is 
 faster than modulo as modulo is fairly slow while this is a 
 correctly predicted branch doing an addition if it doesn't 
 make it branchless.

 That seems the explanation.


 The branchless version gives the same time result as branched, 
 is there a way to force that line not to optimized to compare 
 the predicted version?

 I don't fully understand the question. Do you mean annotations 
 like the __builtin_expect of GCC?

 Bye,
 bearophile

If

uint iter_next = iter + 1 > k? 0 : iter + 1;

is getting optimized to

uint iter_next = (iter + 1) * !(iter + 1 > k);

or something like it by the compiler then it would be nice to be 
able to test the branched code without having the rest of the 
program lose optimizations for speed because as I said, for large 
k branching will almost always be correctly predicted making me 
think it'd be faster than the branchless version.

Jul 03 2012

"bearophile" <bearophileHUGS lycos.com> writes:

ixid:

 I take your point but I think most people know that the equals 
 operators have the lowest associativity.

Sorry I meant:
nums[iter_next] = total % (10 ^^ 8);

Instead of:
nums[iter_next] = total % 10^^8;

But I presume lot of people know that powers are higher 
precedence :-)

Bye,
bearophile

Jul 03 2012

"ixid" <nuaccount gmail.com> writes:

Oops! I have a bad habit of thinking of the power operator as a 
part of the value rather than as an operator.

Jul 03 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, July 03, 2012 15:29:51 bearophile wrote:
 ixid:
 Given popFront is advancing a range does this mean the
 underlying array is not being deleted? How would one delete the
 information you don't need any more if so?

 
 I think the GC can't deallocate part of an array. It's a single
 memory zone, and it has to go all at once.

http://dlang.org/d-array-article.html

- Jonathan M Davis

Jul 03 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - popFront causing more memory to be used