www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - TempAlloc

reply dsimcha <dsimcha yahoo.com> writes:
As per a discusson on digitalmars.D about a month ago, I've created a temp
space allocator for quickly allocating memory in a last in, first out manner.
 Its main use is for allocating scratch space to be used within a function
without introducing threading bottlenecks, etc.  I've released it as an alpha
on scrapple.  I'm working on a major update to my statistics library (also on
scrapple) and will dogfood this there when I'm done with the update.  I think
that it's pretty useful, at least in number crunching-type code, in its
current form as a library.  However, it might be even better if integrated
into druntime and the core language to make it safer (it's pretty dangerous if
used incorrectly) or cleaner to use (I've done the best I can here, but if I
had some compiler support, I could automate some stuff that I make the caller
do.)

For the code, see
http://dsource.org/projects/scrapple/browser/trunk/tempAlloc.  The code is for
D2 only, but could probably be ported to D1 using Tango's thread-local storage.
Dec 05 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-12-06 00:20:49 +0100, dsimcha <dsimcha yahoo.com> said:

 As per a discusson on digitalmars.D about a month ago, I've created a temp
 space allocator for quickly allocating memory in a last in, first out manner.
  Its main use is for allocating scratch space to be used within a function
 without introducing threading bottlenecks, etc.  I've released it as an alpha
 on scrapple.  I'm working on a major update to my statistics library (also on
 scrapple) and will dogfood this there when I'm done with the update.  I think
 that it's pretty useful, at least in number crunching-type code, in its
 current form as a library.  However, it might be even better if integrated
 into druntime and the core language to make it safer (it's pretty dangerous if
 used incorrectly) or cleaner to use (I've done the best I can here, but if I
 had some compiler support, I could automate some stuff that I make the 
 caller do.)
 
 For the code, see
 http://dsource.org/projects/scrapple/browser/trunk/tempAlloc.  The code is for
 D2 only, but could probably be ported to D1 using Tango's thread-local storage.

I think that it could be useful to add an argument to to frameInit/free, in any case not just to speed things up, but to quickly catch mismatched init/free calls. using GC.malloc in malloc for large requested sizes kind of defeats the stated purpose (and what makes superstack more difficult to use) of not being GC scanned, even if I understand the problem of either making bookkeeping more complicated or create potentially big holes in the stack... Fawzi
Dec 06 2008
parent reply dsimca <dsimcha yahoo.com> writes:
== Quote from Fawzi Mohamed (fmohamed mac.com)'s article
 I think that it could be useful to add an argument to to
 frameInit/free, in any case not just to speed things up, but to quickly
 catch mismatched init/free calls.

Can you elaborate on this? I'm not exactly sure what such an argument would look like. Also, I wanted to make TempAlloc fairly simple even if it costs some performance so that I would actually want to use it. If doing something like this adds a lot of complexity for the caller, I probably won't implement it. Right now, I like the simple mixin API.
 using GC.malloc in malloc for large requested sizes kind of defeats the
 stated purpose (and what makes superstack more difficult to use) of not
 being GC scanned, even if I understand the problem of either making
 bookkeeping more complicated or create potentially big holes in the
 stack...

I see over 4MB in a single allocation to be kind of a corner case anyhow. If you need to allocate this much memory for a temp variable that doesn't escape the current scope, in a single allocation (probably pretty unusual already), since the cost of an allocation is sublinear in bytes allocated, the allocation time will likely be dwarfed by the time it takes to use the memory for whatever you need it for. Furthermore, whether by frameInit/frameFree or by TempAlloc.malloc/TempAlloc.free, the memory gets freed deterministically rather than waiting for the GC to run to free it, so it's still faster than using new, etc. In fact, I'm thinking I should have made this limit ~1MB (about 1/4 the size of a TempAlloc block) since allocating 4MB will always require the creation of a new block anyhow. I'm open to suggestions about how to deal with these large block cases, but I will prioritize handling the common case of much smaller allocations efficiently and cleanly over what I feel is somewhat of a corner case.
Dec 06 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-12-06 15:55:44 +0100, dsimca <dsimcha yahoo.com> said:

 == Quote from Fawzi Mohamed (fmohamed mac.com)'s article
 I think that it could be useful to add an argument to to
 frameInit/free, in any case not just to speed things up, but to quickly
 catch mismatched init/free calls.

Can you elaborate on this? I'm not exactly sure what such an argument would look like. Also, I wanted to make TempAlloc fairly simple even if it costs some performance so that I would actually want to use it. If doing something like this adds a lot of complexity for the caller, I probably won't implement it. Right now, I like the simple mixin API.

it could simply be an int storing the place in the stack... So if someone forgets to remove the frame the next call will see it immediately. By the way couldn't you avoid the thread local State by simply keeping a stack of State structures? Then instead of the thread local State you would simply look at the top of the State Stack, it should be faster and more self contained. Then the element to check could really be the size of the stack, and if it is simply an int you can make it optional, if present you check for mismatches... and you get rid of passing the State out... int initFrame()// returns the number of frames of the stack void freeFrame(int handler=-1)// frees the frame, if handler is >=0 then checks the number of frames
 using GC.malloc in malloc for large requested sizes kind of defeats the
 stated purpose (and what makes superstack more difficult to use) of not
 being GC scanned, even if I understand the problem of either making
 bookkeeping more complicated or create potentially big holes in the
 stack...

I see over 4MB in a single allocation to be kind of a corner case anyhow. If you need to allocate this much memory for a temp variable that doesn't escape the current scope, in a single allocation (probably pretty unusual already), since the cost of an allocation is sublinear in bytes allocated, the allocation time will likely be dwarfed by the time it takes to use the memory for whatever you need it for. Furthermore, whether by frameInit/frameFree or by TempAlloc.malloc/TempAlloc.free, the memory gets freed deterministically rather than waiting for the GC to run to free it, so it's still faster than using new, etc. In fact, I'm thinking I should have made this limit ~1MB (about 1/4 the size of a TempAlloc block) since allocating 4MB will always require the creation of a new block anyhow. I'm open to suggestions about how to deal with these large block cases, but I will prioritize handling the common case of much smaller allocations efficiently and cleanly over what I feel is somewhat of a corner case.

no problem I understand, and I am not a user yet. OT: I just read (when looking for your previous messages) about your statistics library. Nice! You might be interested in giving a look to tango's random package, it has fast generation of accurate random bumbers with gauss exp and gamma distributions... D1.0, but bringing it to D2.0 shouldn't be difficult... Fawzi
Dec 07 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Fawzi Mohamed (fmohamed mac.com)'s article
 it could simply be an int storing the place in the stack...
 So if someone forgets to remove the frame the next call will see it
 immediately.
 By the way couldn't you avoid the thread local State by simply keeping
 a stack of State structures?
 Then instead of the thread local State you would simply look at the top
 of the State Stack, it should be faster and more self contained.

I really don't see how this can work. AFAIK, there are only 4 ways to deal with global mutable state in a multithreaded program: 1. Locking. 2. Somehow do everything atomically, if you can. 3. Make the "global" state thread-local. 4. Eliminate it altogether. The biggest reason for me writing TempAlloc in the first place was to avoid lock contention for GC malloc access. TempAlloc is somewhat faster even in single-threaded tests, but the really huge, orders of magnitude speedups come when it's used to avoid this contention.
 Then the element to check could really be the size of the stack, and if
 it is simply an int you can make it optional, if present you check for
 mismatches... and you get rid of passing the State out...
 int initFrame()// returns the number of frames of the stack
 void freeFrame(int handler=-1)// frees the frame, if handler is >=0
 then checks the number of frames

I thought about doing something like this. The biggest problem is large blocks. Since frequent very large allocations are where GC performance really suffers and false pointer issues turn up, I wanted to make sure that these get deleted deterministically, too. Also, just to note: I realized that I had not taken into account alignment when I first wrote TempAlloc. I've updated it to use 16-byte alignment (good for x86), and to fix tempdup to work right with const.
Dec 07 2008
parent Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-12-07 18:44:30 +0100, dsimcha <dsimcha yahoo.com> said:

 == Quote from Fawzi Mohamed (fmohamed mac.com)'s article
 it could simply be an int storing the place in the stack...
 So if someone forgets to remove the frame the next call will see it
 immediately.
 By the way couldn't you avoid the thread local State by simply keeping
 a stack of State structures?
 Then instead of the thread local State you would simply look at the top
 of the State Stack, it should be faster and more self contained.

I really don't see how this can work. AFAIK, there are only 4 ways to deal with global mutable state in a multithreaded program: 1. Locking. 2. Somehow do everything atomically, if you can. 3. Make the "global" state thread-local. 4. Eliminate it altogether. The biggest reason for me writing TempAlloc in the first place was to avoid lock contention for GC malloc access. TempAlloc is somewhat faster even in single-threaded tests, but the really huge, orders of magnitude speedups come when it's used to avoid this contention.

mmh sorry this comes from not having really understood how TmpAlloc was supposed to work. Indeed if you want to hide completely the presence of a State, and not pass it explicitly to the allocation routines you cannot avoid thread local. I was mislead by the name, State is actually the superstack object, what I meant was to keep an extra index in it that simply counts the number of frames, and return/ask it in initFrame freeFrame so that you can check that you did not have any mismatched frames.
Dec 07 2008