digitalmars.D.learn - D memory consumption/runtime speed problem

sybrandy (90/90) Jan 13 2010 Hello,

BCS (3/25) Jan 13 2010 My guess is that you are getting long GC pauses. D's GC is rather unadva...
bearophile (8/13) Jan 13 2010 My suggestions:
Daniel Keep (24/35) Jan 13 2010 I haven't verified this, but I'd be *deeply* suspicious of encodeNumber.

Steven Schveighoffer (7/43) Jan 14 2010 He's not using threads, you can simply do this in encode number:

sybrandy (107/107) Jan 14 2010

bearophile (5/7) Jan 14 2010 Use "private const" in D1 and "private enum" in D2, there's no need for ...

sybrandy <sybrandy gmail.com> writes:

Hello,

I've been writing a bit of compression code and I noticed some strange 
behavior that's driving me a bit batty.  I don't know if it's a bug with 
D or something I did.  All I know is I can't figure it out.

Below is the simplified version of the code as a single file.  It takes 
two parameters.  The first is a file to "compress" and the second is the 
number of times to run the benchmark.  E.g. bugtest foo.txt 2

Now, if I set the second parameter to 1, it runs decently fast.  26 
seconds on my work laptop for a half-sized enwiki8 from the Hutter 
challenge.  If I set it to 2, then it takes about 142 seconds.  In both 
cases a lot of memory is used and I'm not really sure why.  Also, after 
it prints out the results, it takes several seconds for the program to exit.

Am I doing something wrong?  I've tried every trick that I could find by 
reading the documentation.  Btw: The last time I tried this was with the 
latest version of D released at the beginning of the month.

__CODE__

import std.conv;
import std.stdio;
import std.stream;
import std.date;
import std.mmfile;
import std.array;

string filename = "enwik8_small";

private immutable uint ONE_BYTE_VAL = (1 << 6) - 1;
private immutable uint TWO_BYTE_VAL = (1 << 14) - 1;
private immutable uint THREE_BYTE_VAL = (1 << 22) - 1;
private immutable uint FOUR_BYTE_VAL = (1 << 30) - 1;
private immutable uint ONE_BYTE_MASK = (0 << 6);
private immutable uint TWO_BYTE_MASK = (1 << 6);
private immutable uint THREE_BYTE_MASK = (2 << 6);
private immutable uint FOUR_BYTE_MASK = (3 << 6);

ubyte[] encodeNumber(in uint count)
{
     if (count <= ONE_BYTE_VAL)
     {
         return [cast(ubyte)(ONE_BYTE_MASK | count)];
     }
     else if (count <= TWO_BYTE_VAL)
     {
         return [cast(ubyte)(TWO_BYTE_MASK | (count >>> 8))]
                ~ [cast(ubyte)(count & 0x000000ff)];
     }
     else if (count <= THREE_BYTE_VAL)
     {
         return [cast(ubyte)(THREE_BYTE_MASK | (count >>> 16))]
                ~ [cast(ubyte)((count >>> 8) & 0x000000ff)]
                ~ [cast(ubyte)(count & 0x000000ff)];
     }
     else if (count <= FOUR_BYTE_VAL)
     {
         return [cast(ubyte)(FOUR_BYTE_MASK | (count >>> 24))]
                ~ [cast(ubyte)((count >>> 16) & 0x000000ff)]
                ~ [cast(ubyte)((count >>> 8) & 0x000000ff)]
                ~ [cast(ubyte)(count & 0x000000ff)];
     }
     else
     {
         throw new Exception("Invalid count provided!");
     }
}

void encode(in ubyte[] buff, out ubyte[] output)
{
     ubyte currByte = buff[0];
     uint count = 0;
     auto appOutput = appender(&output);
     foreach (byteVal; buff)
     {
         if (byteVal != currByte && count > 0)
         {
             appOutput.put(encodeNumber(count));
             appOutput.put(currByte);
             currByte = byteVal;
             count = 0;
         }
         count++;
     }
     appOutput.put(encodeNumber(count));
     appOutput.put(currByte);
}

void benchCode()
{
     MmFile buff = new MmFile(filename);
     ubyte[] encodedBytes;
     encode(cast(ubyte[])buff[], encodedBytes);
}

void main(string[] args)
{
     filename = args[1];
     writeln("Benchmark time: ", benchmark!(benchCode)(to!(uint)(args[2])));
}

Jan 13 2010

BCS <none anon.com> writes:

Hello sybrandy,

 Hello,
 
 I've been writing a bit of compression code and I noticed some strange
 behavior that's driving me a bit batty.  I don't know if it's a bug
 with D or something I did.  All I know is I can't figure it out.
 
 Below is the simplified version of the code as a single file.  It
 takes two parameters.  The first is a file to "compress" and the
 second is the number of times to run the benchmark.  E.g. bugtest
 foo.txt 2
 
 Now, if I set the second parameter to 1, it runs decently fast.  26
 seconds on my work laptop for a half-sized enwiki8 from the Hutter
 challenge.  If I set it to 2, then it takes about 142 seconds.  In
 both cases a lot of memory is used and I'm not really sure why.  Also,
 after it prints out the results, it takes several seconds for the
 program to exit.
 
 Am I doing something wrong?  I've tried every trick that I could find
 by reading the documentation.  Btw: The last time I tried this was
 with the latest version of D released at the beginning of the month.
 

My guess is that you are getting long GC pauses. D's GC is rather unadvanced. 
IIRC there are ways to can make it do better but I don't know what they are.

Jan 13 2010

bearophile <bearophileHUGS lycos.com> writes:

sybrandy:
 Now, if I set the second parameter to 1, it runs decently fast.  26 
 seconds on my work laptop for a half-sized enwiki8 from the Hutter 
 challenge.  If I set it to 2, then it takes about 142 seconds.  In both 
 cases a lot of memory is used and I'm not really sure why.  Also, after 
 it prints out the results, it takes several seconds for the program to exit.

My suggestions:
- Stop using array joining, those create many small arrays that the GC has to
manage, and the current D GC is not efficient at all. There are many ways to
avoid array joinings and avoid most runtime allocations.
- If you can, at the end of the program you can add std.c.stdlib.exit(0);, this
kills the GC mid-way in a faster way (but probably destructors are not called,
so be careful).
- Use the latest daily build of the LDC compiler with very good command line
arguments, and if needed link-time optimizaiton activated too :-)

And then please tell us how much time it takes to run :-)

Bye,
bearophile

Jan 13 2010

Daniel Keep <daniel.keep.lists gmail.com> writes:

sybrandy wrote:
 Hello,
 
 I've been writing a bit of compression code and I noticed some strange
 behavior that's driving me a bit batty.  I don't know if it's a bug with
 D or something I did.  All I know is I can't figure it out.
 
 ...
 
 Am I doing something wrong?  I've tried every trick that I could find by
 reading the documentation.  Btw: The last time I tried this was with the
 latest version of D released at the beginning of the month.

I haven't verified this, but I'd be *deeply* suspicious of encodeNumber.
 I don't usually use array literals but, if I remember correctly, every
time it is called you're performing a heap allocation.  Even worse,
those concatentations might be performing separate allocations, too.

You could eliminate the overhead by using a passed-in buffer design like so:

ubyte[] encodeNumber(in uint count, ref ubyte[4] buffer)
{
    if (count <= ONE_BYTE_VAL)
    {
        buffer[0] = cast(ubyte)(ONE_BYTE_MASK | count);
        return buffer[0..1];
    }
    // ...
}

Then, in the calling function:

{
    ubyte[4] temp;
    foreach( ... )
    {
        appOutput.put(encodeNumber(count, temp));
    }
}

See if that helps.

Jan 13 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 14 Jan 2010 01:00:31 -0500, Daniel Keep  
<daniel.keep.lists gmail.com> wrote:

 sybrandy wrote:
 Hello,

 I've been writing a bit of compression code and I noticed some strange
 behavior that's driving me a bit batty.  I don't know if it's a bug with
 D or something I did.  All I know is I can't figure it out.

 ...

 Am I doing something wrong?  I've tried every trick that I could find by
 reading the documentation.  Btw: The last time I tried this was with the
 latest version of D released at the beginning of the month.

 I haven't verified this, but I'd be *deeply* suspicious of encodeNumber.
  I don't usually use array literals but, if I remember correctly, every
 time it is called you're performing a heap allocation.  Even worse,
 those concatentations might be performing separate allocations, too.

 You could eliminate the overhead by using a passed-in buffer design like  
 so:

 ubyte[] encodeNumber(in uint count, ref ubyte[4] buffer)
 {
     if (count <= ONE_BYTE_VAL)
     {
         buffer[0] = cast(ubyte)(ONE_BYTE_MASK | count);
         return buffer[0..1];
     }
     // ...
 }

 Then, in the calling function:

 {
     ubyte[4] temp;
     foreach( ... )
     {
         appOutput.put(encodeNumber(count, temp));
     }
 }

 See if that helps.

He's not using threads, you can simply do this in encode number:

static ubyte[4] buffer;

In fact, that might even work in D2 *with* threads since buffer would be  
thread local :)

-Steve

Jan 14 2010

sybrandy <sybrandy gmail.com> writes:

<snip>

Using a small buffer as suggested by Daniel Keep and Steven 
Schveighoffer significantly improved performance.  Now down to about 5 
seconds.  I ended up using the static array buffer since the 
encodeNumber function will be in its own file in my resulting program, 
so I can keep it private.  Doing something similar to the output buffer 
had a similar effect and it's now processing everything in less than 2 
seconds!

I didn't realize that all those little arrays were created.  Perhaps 
this is something that should be detailed in the arrays documentation 
or, perhaps even better, an optimization guide?  I honestly thought the 
GC would help keep memory in check as I didn't want to assume a 
worst-case scenario, which with my RLE implementation is 2 * input 
buffer size, but I guess I have to.

Well, perhaps my original code will be of use if Walter and the gang 
decide to try to revamp the GC and want some "bad" code to test it with.

Thanks all!

Btw: below is the updated code

import std.conv;
import std.stdio;
import std.stream;
import std.date;
import std.mmfile;
import std.array;

string filename = "enwik8_small";

private immutable uint ONE_BYTE_VAL = (1 << 6) - 1;
private immutable uint TWO_BYTE_VAL = (1 << 14) - 1;
private immutable uint THREE_BYTE_VAL = (1 << 22) - 1;
private immutable uint FOUR_BYTE_VAL = (1 << 30) - 1;
private immutable uint ONE_BYTE_MASK = (0 << 6);
private immutable uint TWO_BYTE_MASK = (1 << 6);
private immutable uint THREE_BYTE_MASK = (2 << 6);
private immutable uint FOUR_BYTE_MASK = (3 << 6);
private static ubyte[4] encodeBuff;

ubyte[] encodeNumber(in uint count)
{
     if (count <= ONE_BYTE_VAL)
     {
         encodeBuff[0] = cast(ubyte)(ONE_BYTE_MASK | count);
         return encodeBuff[0..1];
     }
     else if (count <= TWO_BYTE_VAL)
     {
         encodeBuff[0] = cast(ubyte)(TWO_BYTE_MASK | (count >>> 8));
         encodeBuff[1] = cast(ubyte)(count & 0x000000ff);
         return encodeBuff[0..2];
     }
     else if (count <= THREE_BYTE_VAL)
     {
         encodeBuff[0] = cast(ubyte)(THREE_BYTE_MASK | (count >>> 16));
         encodeBuff[1] = cast(ubyte)((count >>> 8) & 0x000000ff);
         encodeBuff[2] = cast(ubyte)(count & 0x000000ff);
         return encodeBuff[0..3];
     }
     else if (count <= FOUR_BYTE_VAL)
     {
         encodeBuff[0] = cast(ubyte)(FOUR_BYTE_MASK | (count >>> 24));
         encodeBuff[1] = cast(ubyte)((count >>> 16) & 0x000000ff);
         encodeBuff[2] = cast(ubyte)((count >>> 8) & 0x000000ff);
         encodeBuff[3] = cast(ubyte)(count & 0x000000ff);
         return encodeBuff[0..4];
     }
     else
     {
         throw new Exception("Invalid count provided!");
     }
}

void encode(in ubyte[] buff, ref ubyte[] output)
{
     ubyte currByte = buff[0];
     uint count = 0;
     uint outIdx = 0;
     ubyte[] temp;
     foreach (byteVal; buff)
     {
         if (byteVal != currByte && count > 0)
         {
             temp = encodeNumber(count);
             foreach (t; temp)
             {
                 output[outIdx++] = t;
             }
             output[outIdx++] = currByte;
             currByte = byteVal;
             count = 0;
         }
         count++;
     }
     temp = encodeNumber(count);
     foreach (t; temp)
     {
         output[outIdx++] = t;
     }
     output[outIdx++] = currByte;
}

void benchCode()
{
     MmFile buff = new MmFile(filename);
     ubyte[] encodedBytes;
     encodedBytes.length = cast(size_t)buff.length * 2;
     encode(cast(ubyte[])buff[], encodedBytes);
}

void main(string[] args)
{
     filename = args[1];
     writeln("Benchmark time: ", benchmark!(benchCode)(to!(uint)(args[2])));
}

Jan 14 2010

bearophile <bearophileHUGS lycos.com> writes:

sybrandy:
 private immutable uint ONE_BYTE_VAL = (1 << 6) - 1;
 private immutable uint TWO_BYTE_VAL = (1 << 14) - 1;

Use "private const" in D1 and "private enum" in D2, there's no need for an
immutable here.
In your code there are now no useless memory allocations, so the exit(0) trick
is not needed.

Bye,
bearophile

Jan 14 2010

D Programming

C/C++ Programming

Other

digitalmars.D.learn - D memory consumption/runtime speed problem