www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - TempAlloc: Request for Review For Inclusion in core.memory

reply dsimcha <dsimcha yahoo.com> writes:
My TempAlloc module's been ready for review for a while, but 
std.net.isemail and std.parallelism were ahead of it in the review 
queue.  Now that the reviews for both are done (I doubt running a review 
concurrently with a vote is problematic), I want to put this next in 
line.  I'd like to get it in druntime soon because Don wants to use it 
in BigInt and I also have some things I want to contribute to Phobos 
that TempAlloc would be useful for.

TempAlloc is a segmented stack for allocating and freeing temporary 
buffers efficiently.  It works by allocating memory in large chunks 
(currently 4 MB, but this number may be revised) at a time from the C 
heap, and parceling them out with the standard last in first out stack 
restriction.  The proposal also contains a few small helper functions 
that I think would be useful, but these can be made private or removed 
if people don't like them.

TempAlloc has the following advantages compared to allocation on the
call stack:

1. Pointers to memory allocated on the TempAlloc stack are still
valid when the function they were allocated from returns.
Functions can be written to create and return data structures on the
TempAlloc stack.

2. Since it is a segmented stack, large allocations can be performed 
with no danger of stack overflow errors.

It has the following advantages compared to heap allocation:

1. Both allocation and deallocation are extremely fast. Most allocations
consist of verifying enough space is available, incrementing a pointer 
and a performing a few cheap bookkeeping operations. Most deallocations
consist decrementing a pointer and performing a few cheap bookkeeping
operations.

2. The segmented stack is thread-local, so synchronization is only 
needed when a segment needs to be allocated or freed.

3. Fragmentation is not an issue when allocating memory on the
TempAlloc stack, though it can be an issue when trying to allocate
a new segment.

Code:

https://github.com/dsimcha/TempAlloc/blob/master/tempalloc.d

Docs:

http://cis.jhu.edu/~dsimcha/d/phobos/core_tempalloc.html
Apr 19 2011
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Nice work. I've tried it out with allocating some large float arrays,
quick two test cases were a 94MB and 282MB audio file.

File 1 w/o TempAlloc: 410msecs
File 1 w/ TempAlloc: 184msecs

File 2 w/o TempAlloc 1321 msecs
File 2 w/ TempAlloc 567 msecs

Btw, how much *can* I allocate at once by using new (not specific to
TempAlloc)? I've tried a 500MB file, but allocation fails even though
I have plenty of free memory on my system. Is there a limit to using
'new'?

Of course, ideally I'd use some kind of buffering mechanism where a
portion of a file is loaded into memory, and when that's exhausted I'd
load another chunk. But I'd like to know the limits for a single
memory allocation on the garbage-collected heap. Is there some
internal limit for D? I know I can't allocate more than 3GB in the
entire app, but I'm looking at single allocations.

Pardon me if this all sounds stupid, I'm a complete newb when it comes
to memory allocation mechanics. :)
Apr 19 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Nice work. I've tried it out with allocating some large float arrays,
 quick two test cases were a 94MB and 282MB audio file.
 File 1 w/o TempAlloc: 410msecs
 File 1 w/ TempAlloc: 184msecs
 File 2 w/o TempAlloc 1321 msecs
 File 2 w/ TempAlloc 567 msecs
 Btw, how much *can* I allocate at once by using new (not specific to
 TempAlloc)? I've tried a 500MB file, but allocation fails even though
 I have plenty of free memory on my system. Is there a limit to using
 'new'?
 Of course, ideally I'd use some kind of buffering mechanism where a
 portion of a file is loaded into memory, and when that's exhausted I'd
 load another chunk. But I'd like to know the limits for a single
 memory allocation on the garbage-collected heap. Is there some
 internal limit for D? I know I can't allocate more than 3GB in the
 entire app, but I'm looking at single allocations.
 Pardon me if this all sounds stupid, I'm a complete newb when it comes
 to memory allocation mechanics. :)
Ironically, these benchmarks just show that the C heap is faster than the D GC heap. They are well over the 4 MB mark, where TempAlloc just falls back on the C heap. As far as how much you can allocate on the GC heap, it boils down to how much **contiguous** free space you have in your address space. Since 32-bit address space is only 4 GB (and only 2 GB for user space on Win32), I'm not surprised you can't allocate more than 500 MB.
Apr 19 2011
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Ok, perhaps you should put a note in the docs about falling back to
the C heap. Or maybe that's just common sense for people in the
know-how?

I've realized the allocation attempt was actually quite a bit more
than 500Mb. The C lib returned the number of channels and frame count,
which totalled to a frame count of 291_335_184. Multiply that by the
float size (4), and you get: 1_165_340_736, or 1.65 GB.

The GC heap can actually allocate even 200_000_000 elements on my
system, so that's around 800MB if I'm right.
Apr 19 2011
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Ok, perhaps you should put a note in the docs about falling back to
 the C heap. Or maybe that's just common sense for people in the
 know-how?
Right. This is roughly equivalent to allocating a new segment. I might put more details into the docs about the heuristics used, but I'm not sure how much I should document them. They're somewhat complicated, may change, and are implementation details. On the other hand, the whole point of TempAlloc is a performance optimization, and knowing such details is useful for predicting performance. On the third hand, it's open source. If you really want to know **exactly** how it works instead of just how to use it, you can always read the code.
Apr 19 2011
prev sibling parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 19 Apr 2011 18:34:33 +0400, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Nice work. I've tried it out with allocating some large float arrays,
 quick two test cases were a 94MB and 282MB audio file.
 File 1 w/o TempAlloc: 410msecs
 File 1 w/ TempAlloc: 184msecs
 File 2 w/o TempAlloc 1321 msecs
 File 2 w/ TempAlloc 567 msecs
 Btw, how much *can* I allocate at once by using new (not specific to
 TempAlloc)? I've tried a 500MB file, but allocation fails even though
 I have plenty of free memory on my system. Is there a limit to using
 'new'?
 Of course, ideally I'd use some kind of buffering mechanism where a
 portion of a file is loaded into memory, and when that's exhausted I'd
 load another chunk. But I'd like to know the limits for a single
 memory allocation on the garbage-collected heap. Is there some
 internal limit for D? I know I can't allocate more than 3GB in the
 entire app, but I'm looking at single allocations.
 Pardon me if this all sounds stupid, I'm a complete newb when it comes
 to memory allocation mechanics. :)
Ironically, these benchmarks just show that the C heap is faster than the D GC heap. They are well over the 4 MB mark, where TempAlloc just falls back on the C heap. As far as how much you can allocate on the GC heap, it boils down to how much **contiguous** free space you have in your address space. Since 32-bit address space is only 4 GB (and only 2 GB for user space on Win32), I'm not surprised you can't allocate more than 500 MB.
I think you are wrong about contiguous memory block. I believe virtual address space is divided into blocks (which are, indeed, contiguous) but the address space itself is not. What I mean is that if you try to allocate 1GB of memory, an OS might succeed that even if the physical memory is "fragmented" by other processes. This isn't the case for ALL operating systems, though. I've believe some OSes don't even allocate memory physically until you actually access it (by reading or writing to it), and even then they may only allocate memory for requested pages only. I might be wrong though.
Apr 19 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 19.04.2011 17:02, schrieb Denis Koroskin:
 On Tue, 19 Apr 2011 18:34:33 +0400, dsimcha <dsimcha yahoo.com> wrote:
 
 == Quote from Andrej Mitrovic (andrej.mitrovich gmail.com)'s article
 Nice work. I've tried it out with allocating some large float arrays,
 quick two test cases were a 94MB and 282MB audio file.
 File 1 w/o TempAlloc: 410msecs
 File 1 w/ TempAlloc: 184msecs
 File 2 w/o TempAlloc 1321 msecs
 File 2 w/ TempAlloc 567 msecs
 Btw, how much *can* I allocate at once by using new (not specific to
 TempAlloc)? I've tried a 500MB file, but allocation fails even though
 I have plenty of free memory on my system. Is there a limit to using
 'new'?
 Of course, ideally I'd use some kind of buffering mechanism where a
 portion of a file is loaded into memory, and when that's exhausted I'd
 load another chunk. But I'd like to know the limits for a single
 memory allocation on the garbage-collected heap. Is there some
 internal limit for D? I know I can't allocate more than 3GB in the
 entire app, but I'm looking at single allocations.
 Pardon me if this all sounds stupid, I'm a complete newb when it comes
 to memory allocation mechanics. :)
Ironically, these benchmarks just show that the C heap is faster than the D GC heap. They are well over the 4 MB mark, where TempAlloc just falls back on the C heap. As far as how much you can allocate on the GC heap, it boils down to how much **contiguous** free space you have in your address space. Since 32-bit address space is only 4 GB (and only 2 GB for user space on Win32), I'm not surprised you can't allocate more than 500 MB.
I think you are wrong about contiguous memory block. I believe virtual address space is divided into blocks (which are, indeed, contiguous) but the address space itself is not. What I mean is that if you try to allocate 1GB of memory, an OS might succeed that even if the physical memory is "fragmented" by other processes.
That doesn't help if the logical memory of your process is fragmented as well :)
 This isn't the case for ALL operating systems, though.
 
 I've believe some OSes don't even allocate memory physically until you
 actually access it (by reading or writing to it), and even then they may
 only allocate memory for requested pages only.
 
 I might be wrong though.
Apr 19 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Oh, I should mention I was using a C library to actually load the data
into the buffer, so I didn't use -J imports or anything like that.
Apr 19 2011
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
slackCat should probably use the common type, not the first one passed?
Also it would be cool to have addRange/removeRange somehow automated for 
types that have indirections.
Or it's too much state to carry around?

-- 
Dmitry Olshansky
Apr 19 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Dmitry Olshansky (dmitry.olsh gmail.com)'s article
 slackCat should probably use the common type, not the first one passed?
Good point. Will fix.
 Also it would be cool to have addRange/removeRange somehow automated for
 types that have indirections.
 Or it's too much state to carry around?
Yeah, I don't want this because a major purpose of TempAlloc is to allow efficient parallelism, and addRange/removeRange involves locks. Besides, I think they use linear search in the current GC implementation rather than something more efficient. Once you're doing stuff like this, the advantages of TempAlloc go away rapidly. Besides, it's perfectly safe (at least now, while the GC isn't moving) to store a reference to GC memory in TempAlloc, as long as it's not the **only** reference. It's also perfectly safe to have indirections from TempAlloc data into other TempAlloc data, as long as the LIFO freeing order isn't violated.
Apr 19 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
On 4/19/2011 12:08 PM, Dmitry Olshansky wrote:
 slackCat should probably use the common type, not the first one passed?
Fixed. https://github.com/dsimcha/TempAlloc/commit/5e3fd0db9c905d361695d267e460a436c2b26666 https://github.com/dsimcha/TempAlloc/commit/5e74dd20d92c545cb4e3945c81175307c60b1253
Apr 19 2011