Computational complexity for HeapAlloc - c++

C++ implementation of vector relies on extending it twice when it's current capacity is exceeded. push_back operation, obviously, uses HeapAlloc for this matter on Windows, and, according to C++ standard, should somehow use amortized constant time. How long does it take for Windows to HeapAlloc, and how could one compute that push_back uses amortized constant time from this, without prior knowledge of the set of instructions in one specific program?

I think that it is pretty safe to assume that HeapAlloc (or dynamic allocation in general, in any OS) is amortized constant.
First, when you ask the heap to deliver a given amount of memory, it has to find a good chunk of free memory that it can allocate to you. The internal logic of how that chunk of memory is found is specific to the implementation (but there are some good clues to be found), and could vary quite a bit depending on things like fragmentation and efficiency. However, if an implementation of a heap cannot deliver this in amortized constant time, then that implementation is really terrible. It is generally called a "heap" because the basic implementation for this "find a good chunk" mechanism is to use a kind of priority-queue or list of first-available chunks which can deliver the (near-)best chunk immediately. More sophisticated versions are not going to be worse than that on average (in amortized sense), but could involve additional "book-keeping" work depending on how it keeps track of free chunks and how much it tries to reduce fragmentation, for a starting-point, look at the classic buddy allocator. So, that's for the "find a good chunk" part. There are, of course, many other interesting issues, like contention, but overall, it's still amortized constant-time (it's the constant factor that is much more important here).
The other part of what the heap does is facing the OS kernel (or other OS-related layers) to obtain additional memory whenever it runs out. Basically, the heap is a memory manager that obtains large chunks of memory from the OS (or back-end), and then micro-manages those large chunks of memory to allocate all the smaller chunks that are requested by the program (or front-end) (through new, malloc, HeapAlloc, etc.). Once in a while, you are going to ask the heap for more memory, and the heap will find that it has run out of memory to give, so, it has to turn to the OS and ask for another large chunk. As you see, the heap itself must now behave very much like std::vector in the sense that it has to exponentially grow in size in order to amortize the cost of the requests to the OS for more memory.
When the OS kernel delivers memory to the heap, then the cost is, again, constant (most likely), because it is probably implemented in a very similar way to the way that a heap allocates memory within the program. The difference is, of course, that if the OS runs out of memory, it can't turn to anyone for more memory, so it just fails instead. What is the most expensive about requesting memory from the OS kernel is the switch between user-mode / kernel-mode, inter-process contention, and the expansion of the virtual address space.
how could one compute that push_back uses amortized constant time from this, without prior knowledge of the set of instructions in one specific program?
Basically, the C++ standard requires push_back on std::vector to be achieved in amortized constant-time excluding the cost of the heap-allocation, as they specifically exclude the memory allocation time from the analysis, which they cannot really dictate or predict. But, you can test it empirically if you want, I have in the past, and I can say for sure (at least under Linux) that push_back on std::vector is indeed amortized constant-time.

The term "amortized" constant time means "if you perform a long sequence of operations, each operation will take, on average, a constant time." The way this works with a vector is with table doubling. When your vector runs out of space, instead of creating more space for a single item, it doubles the size of the current vector. This will take time equal to the size of the vector (O(n) for you algorithms people), but you only have to do it when you double the size of the table. Therefore, if you start with a vector of size n and add n more items, the first item will take O(n) time (to double the size of the table and copy the old values over), but the next n-1 items will take O(1) time. This gives O(n) + (n-1)O(1) time for n insertions, or a total of O(2n) = O(n) time. Averaging this gives out O(1) time for each operation.

Keep in mind that the STL will use new (or custom allocators) so it is not required that HeapAlloc will be called on Windows platforms.
That being said, the STL documentation cannot guaranteed any run-time consistency with lower-level API calls, it just defines what is implemented within its own walls.
Also keep in mind that the implementation of HeapAlloc is subject to change between OS or even Service Pack releases, so any answers here can easily be obsoleted by future releases.
Personally, I would be more concerned about the number and length of locks within the heap (required to ensure data-structure consistency due to simultaneous access via multiple threads).
Further reading: MSDN -- Heap: Pleasures and Pains

Related

Measuring Memory Consumption used by a concurrent code in C++

I'm implementing the FineList and the LazyList classes in C++. Both the above concurrent linked lists have been implemented in Java in the book "The Art of Multiprocessor Programming". I'd like to measure the amount of memory consumed by each algorithm during it's entire course of execution. I'm not sure how do I do that. I could just keep track of the number of times "new" is called by each thread, in a counter, in both algorithms and use that as a measuring criteria. Similarly, whenever a thread calls "delete", I'd decrement the counter. Is that a fair criteria to measure memory consumption? The problem is FineList algorithm allows me to immediately "delete" a node once it's removed from the linked-list, due to it's lock-based nature. But that's not the case in LazyList algorithm, since it has lock-free methods. Is there any other way to measure memory consumption or is the above method fair to both algorithms?
If you keep an ordered log of the new and delete calls (including the size requested) and you are sure the code you are interested in only uses new and delete and not other allocation routines, you can determine more or less the theoretical memory consumption at any point in time by keeping a running tally of the excess of memory allocated over memory freed. You could perhaps generate such a log by overloading global operator new(size_t) as well as delete.
The number will be only theoretical due to several factors:
Allocators add a certain amount of overhead, so the actual memory allocated will generally be larger than the sum of the sizes of the objects allocated. This overhead includes fragmentation, since some unallocated memory could be technically-free-but-not-really-available.
Allocators may not return any memory to the OS, or they may return it, but in an unpredictable way. So if you are measuring "memory allocated" as far as the OS is concerned (versus as far as the runtime allocator is concerned), you have a harder problem and this is highly allocator dependent.
Especially in a multi-threaded scenario, not all freed memory is necessarily used by future allocations. A specific case of this for thread-aware allocators is the use of thread-local allocation buffers: memory freed on one thread might not be immediately usable on another thread, until some threshold is reached whereby allocations can move across threads. This might be relevant for your scenario if there is a disparity of nodes allocated and freed among threads.
There is a whole layer of complexity when it comes to determining how much memory has been allocated - even to define that term. For example, for a large allocation, the OS may return you a chunk of memory which doesn't actually exist in RAM, and only page it is lazily as you access it. So if you don't access everything you allocate, the numbers reported by the allocator could actually be an over-estimate.
That's just scratching the surface.
C++ allows you to provide your own operator new and matching operator delete. This is useful for such measurement tasks. You can even use this to figure out the memory used as a function of the allocation strategy, e.g. see how much more memory your algorithm needs when rounding allocation up to a multiple of 16 bytes.
The large benefit of this approach is that you don't need to touch the code of your algorithm itself. You can use the bookkeeping of your own allocator.
Mind you, the idea of lock-free programming can be overly optimistic if you actually need new. There's no guarantee in C++ whatsoever that new is lock-free. And since C++ allows you to new memory in one thread and delete it in another, some cross-thread synchronization is needed.

How to allocate a large dynamic array in C++?

So I am currently trying to allocate dynamically a large array of elements in C++ (using "new"). Obviously, when "large" becomes too large (>4GB), my program crashes with a "bad_alloc" exception because it can't find such a large chunk of memory available.
I could allocate each element of my array separately and then store the pointers to these elements in a separate array. However, time is critical in my application so I would like to avoid as much cache misses as I can. I could also group some of these elements into blocks but what would be the best size for such a block?
My question is then: what is the best way (timewise) to allocate dynamically a large array of elements such that elements do not have to be stored contiguously but they must be accessible by index (using [])? This array is never going to be resized, no elements is going to be inserted or deleted of it.
I thought I could use std::deque for this purpose, knowing that the elements of an std::deque might or might not be stored contiguously in memory but I read there are concerns about the extra memory this container takes?
Thank you for your help on this!
If your problem is such that you actually run out of memory allocating fairly small blocks (as is done by deque) is not going to help, the overhead of tracking the allocations will only make the situation worse. You need to re-think your implementation such that you can deal with it in blocks that will still fit in memory. For such problems, if using x86 or x64 based hardware I would suggest blocks of at least 2 megabytes (the large page size).
Obviously, when "large" becomes too large (>4GB), my program crashes
with a "bad_alloc" exception because it can't find such a large chunk
of memory available.
You should be using 64-bit CPU and OS at this point, allocating huge contiguous chunk of memory should not be a problem, unless you are actually running out of memory. It is possible that you are building 32-bit program. In this case you won't be able to allocate more than 4 GB. You should build 64-bit application.
If you want something better than plain operator new, then your question is OS-specific. Look at API provided by your OS: on POSIX system you should look for mmap and for VirtualAlloc on Windows.
There are multiple problems with large allocations:
For security reasons OS kernel never gives you pages filled with garbage values, instead all new memory will be zero initialized. This means you don't have to initialize that memory as long as zeroes are exactly what you want.
OS gives you real memory lazily on first access. If you are processing large array, you might waste a lot of time taking page faults. To avoid this you can use MAP_POPULATE on Linux. On Windows you can try PrefetchVirtualMemory (but I am not sure if it can do the job). This should make init allocation slower, but should decrease total time spent in kernel.
Working with large chunks of memory wastes slots in Translation Lookaside Buffer (TLB). Depending on you memory access pattern, this can cause noticeable slowdown. To avoid this you can try using large pages (mmap with MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB on Linux, VirtualAlloc and MEM_LARGE_PAGES). Using large pages is not easy, as they are usually not available by default. They also cannot be swapped out (always "locked in memory"), so using them requires privileges.
If you don't want to use OS-specific functions, the best you can find in C++ is std::calloc. Unlike std::malloc or operator new it returns zero initialized memory so you can probably avoid wasting time initializing that memory. Other than that, there is nothing special about that function. But this is the closest you can get while staying withing standard C++.
There are no standard containers designed to handle large allocations, moreover, all standard container are really really bad at handling those situations.
Some OSes (like Linux) overcommit memory, others (like Windows) do not. Windows might refuse to give you memory if it knows it won't be able to satisfy your request later. To avoid this you might want to increase your page file. Windows needs to reserve that space on disk beforehand, but it does not mean it will use it (start swapping). As actual memory is given to programs lazily, there are might be a lot of memory reserved for applications that will never be actually given to them.
If increasing page file is too inconvenient, you can try creating large file and map it into memory. That file will serve as a "page file" for your memory. See CreateFileMapping and MapViewOfFile.
The answer to this question is extremely application, and platform, dependent. These days if you just need a small integer factor greater than 4GB, you use a 64-bit machine, if possible. Sometimes reducing the size of the element in the array is possible as well. (E.g. using 16-bit fixed-point of half-float instead of 32-bit float.)
Beyond this, you are either looking at sparse arrays or out-of-core techniques. Sparse arrays are used when you are not actually storing elements at all locations in the array. There are many possible implementations and which is best depends on both the distribution of the data and the access pattern of the algorithm. See Eigen for example.
Out-of-core involves explicitly reading and writing parts of the array to/from disk. This used to be fairly common, but people work pretty hard to avoid doing this now. Applications that really require such are often built on top of a database or similar to handle the data management. In scientific computing, one ends up needing to distribute the compute as well as the data storage so there's a lot of complexity around that as well. For important problems the entire design may be driven by having good locality of reference.
Any sparse data structure will have overhead in how much space it takes. This can be fairly low, but it means you have to be careful if you actually have a dense array and are simply looking to avoid memory fragmentation.
If your problem can be broken into smaller pieces that only access part of the array at a time and the main issue is memory fragmentation making it hard to allocate one large block, then breaking the array in to pieces, effectively adding an outer vector of pointers, is a good bet. If you have random access to an array larger than 4 gigabytes and no way to localize the accesses, 64-bit is the way to go.
Depending on what you need the memory for and your speed concerns, and if you're using Linux, you can always try using mmap and simulate a sort of swap. It might be slower, but you can map very large sizes. See Mmap() an entire large file

Is there a downside to a significant overestimation in a reserve()?

Let's suppose we have a method that creates and uses possibly very big vector<foo>s.
The maximum number of elements is known to be maxElems.
Standard practice as of C++11 is to my best knowledge:
vector<foo> fooVec;
fooVec.reserve(maxElems);
//... fill fooVec using emplace_back() / push_back()
But what happens if we have a scenario where the number of elements is going to be significantly less in the majority of calls to our method?
Is there any disadvantage to the conservative reserve call other than the excess allocated memory (which supposably can be freed with shrink_to_fit() if necessary)?
Summary
There is likely to be some downside to using a too-large reserve, but how much is depends both on the size and context of your reserve() as well as your specific allocator, operating system and their configuration.
As you are probably aware, on platforms like Windows and Linux, large allocations are generally not allocating any physical memory or page table entries until it is first accessed, so you might imagine large, unused allocations to be "free". Sometimes this is called "reserving" memory without "committing" it, and I'll use those terms here.
Here are some reasons this might not be as free as you'd imagine:
Page Granularity
The lazy commit described above only happens at a page granularity. If you are using (typical) 4096 byte pages, it means that if you usually reserve 4,000 bytes for a vector that will usually contains elements taking up 100 bytes, the lazy commit buys you nothing! At least the whole page of 4096 bytes has to be committed and you don't save physical memory. So it isn't just the ratio between the expected and reserved size that matters, but the absolute size of the reserved size that determines how much waste you'll see.
Keep in mind that many systems are now using "huge pages" transparently, so in some cases the granularity will be on the order of 2 MB or more. In that case you need allocations on the order of 10s or 100s of MB to really take advantage of the lazy allocation strategy.
Worse Allocation Performance
Memory allocators for C++ generally try to allocate large chunks of memory (e.g., via sbrk or mmap on Unix-like platforms) and then efficiently carve that up into the small chunks the application is requesting. Getting these large chunks of memory via say a system call like mmap may be several orders of magnitude slower than the fast path allocation within the allocator which is often only a dozen instructions or so. When you ask for large chunks that you mostly won't use, you defeat that optimization and you'll often be going down the slow path.
As a concrete example, let's say your allocator asks mmap for chunks of 128 KB which it carves up to satisfy allocations. You are allocating about 2K of stuff in a typical vector, but reserve 64K. You'll now pay a mmap call for every other reserve call, but if you just asked for the 2K you ultimately needed, you'd have about 32 times fewer mmap calls.
Dependence on Overcommit Handling
When you ask for a lot of memory and don't use it, you can get into the situation where you've asked for more memory than your system supports (e.g., more than your RAM + swap). Whether this is even allowed depends on your OS and how it is configured, and no matter what you are up for some interesting behavior if you subsequently commit more memory simply by writing it. I means that arbitrary processes may be killed, or you might get unexpected errors on any memory write. What works on one system may fail on another due to different overcommit tunables.
Finally, it makes managing your process a bit harder since the "VM size" metric as reported by monitoring tools won't have much relationship to what your process may ultimately commit.
Worse Locality
Allocating more memory than you need makes it likely that your working set will be more sparsely spread out in the virtual address space. The overall effect is a reduction in locality of reference. For very small allocations (e.g., a few dozen bytes) this may reduce the within-same-cache-line locality, but for larger sizes the main effect is likely to be to spread your data onto more physical pages, increasing TLB pressure. The exact thresholds will depend a lot on details like whether hugepages are enabled.
What you cite as standard C++11 practice is hardly standard, and probably not even good practice.
These days I'd be inclined to discourage the use of reserve, and let your platform (i.e. the C++ standard library optimised to your platform) deal with the reallocation as it sees fit.
That said, calling reserve with an excessive amount may well also effectively be benign due to modern operating systems only giving you the memory if you actually use it (Linux is particularly good at that). But relying on this could cause you trouble if you port to a different operating system, whereas simply omitting reserve is less likely to.
You have 2 options:
You don't call reserve and let the default implementation of the vector figure out the size, which uses exponential growth.
Or
You call reserve(maxElems) and shrink_to_fit() afterwards.
The first option is less likely to give you a std::bad_alloc (even though modern OS's probably will never throw this if you don't touch the last block of the reserved memory)
The second option is less likely to invoke multiple calls to reserve, the first option will most likely have 2 calls : the reserve and the shrink_to_fit() (which might be a no-op depending on the implementation since it's non-binding) while option 2 might have significant more. Less calls = better performance.
If you are on linux reserve will call malloc which only allocates virtual memory, but not physical. Physical memory will be used when you actually insert elements to a vector. That's why you can considerably overestimate reserve size.
If you can estimate maximum vector size you can reserve it just once on start to avoid reallocations and no physical memory will be wasted.
But what happens if we have a scenario where the number of elements is going to be significantly less in the majority of calls to our method?
The allocated memory simply remains unused.
Is there a downside to a significant overestimation in a reserve()?
Yes, at least a potential downside: The memory that was allocated for the vector can not be used for other objects.
This is especially problematic in embedded systems that do not usually have virtual memory, and little physical memory to spare.
Concerning programs running inside an operating system, if the operating system does not "over commit" the memory, then this can still easily cause the virtual memory allocation of the program to reach the limit given to the process.
Even in over committing system, particularly gratuitous overestimation can in theory result in exhaustion of virtual address space. But you need pretty big numbers to achieve that on 64 bit architectures.
Is there any disadvantage to the conservative reserve call other than the excess allocated memory (which supposably can be be freed with shrink_to_fit() if necessary)?
Well, this is slower than initially allocating exactly correct amount of memory, but the difference might be marginal.

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.