Find next available chunk in a memory pool

Find next available chunk in a memory pool - c++

So, I've been spending some time implementing a memory pool class in C++. Except for some minor problems along the way, it's gone fairly well. However, when I tried testing it today by allocating 1000 chunks by first using the memory pool and then comparing it to using new, I was actually getting close too three times worse performance (in nano seconds) when using the memory pool. My allocation method looks like this:
template <class T> T* MemPool<T>::allocate()
{
Chunk<T>* tempChunk = _startChunk;
while (tempChunk->_free == false)
{
if (tempChunk->_nextChunk == NULL)
throw std::runtime_error("No available chunks");
tempChunk = tempChunk->_nextChunk;
}
tempChunk->_free = false;
return &tempChunk->object;
}
I am starting at the first chunk in the pool and doing a search through the pool's linked list until I find a free chunk, or reach the end of the pool. Now, the bigger the pool, the longer this will take as the search has an O(n) time complexity where n is the number of chunks in the pool.
So I was curious as to if anyone have any thoughts on how to improve the allocation? My initial thought was to use two linked lists instead of just the one, where one contains free chunks and the other allocated chunks. When a new chunk is to be allocated, I would simply take the first element in the first mentioned linked list and move it to the allocated linked list. As far as I can see, this would eliminate the need to do any searching when allocating, and leave only deallocating requiring a search to find the correct chunk.
Any thoughts are appreciated as this is my first time working directly with memory in this way. Thanks!

Instead of using a hand-crafted linked list, it would probably be more effective to use a std::list (particularly if you use it with a custom allocator). Less error prone, and probably better optimised.
Using two lists will allows simplifying a lot. No need to track, in the list itself, if a chunk is free or not - since that will be specified by which list the chunk is in (all that is needed is to ensure a chunk somehow doesn't appear in both lists).
Your current implementation means you are having to walk the linked list, both when allocating and deallocating.
If the chunks are fixed size, then allocation would simply be implemented by moving the first available chunk from the free to the allocated list - no need to search. To deallocate a chunk, you would still need to find it in the allocated list, which means you would need to map a T*to an entry in the list (e.g. perform a search), but then the act of deallocation will be simply moving the entry from one list to the other.
If the chunks are variable size, you'll need to do a bit more work. Allocating would require finding a chunk that is at least the requested size when allocating. Overallocating (allocating a larger chunk than needed) would make allocation and deallocation more efficent in terms of performance, but also mean that fewer chunks can be allocated from the pool. Alternatively, break a large chunk (from the free list) in two, and place one entry on both lists (representing the allocated part, and the part left unallocated). If you do this, when deallocating, it may be desirable to merge chunks that are adjacent in memory (effectively, implement defragmentation of the free memory in the pool).
You will need to decide whether the pool can be used from multiple threads, and use appropriate synchronisation.

Use a fixed number of size bins, and make each bin a linked list.
For instance, let's say your bins are simply the integer multiples of the system page size (usually 4KiB), and you use 1MiB chunks; then you have 1MiB/4KiB = 256 bins. If a free makes an n-page region available in a chunk, append it to bin n. When allocating an n-page region, walk through the bins from n to 256 and choose the first available chunk.
To maximize performance, associate the bins with a bitmap, then scan from bit n-1 to bit 255 to find the first set bit (count the leading or trailing zeroes using compiler intrinsics like __builtin_clz and _BitScanForward). That's still not quite O(1) due to the number of bins, but it's pretty close.
If you're worried about memory overhead, you could append each chunk only once for each bin. That is, even if a chunk has 128 1-page regions available (maximally fragmented), bin 1 will still only link to the chunk once and reuse it 128 times.
To do this you'd have to link these regions together inside each chunk, which means each chunk will also need to store a list of size bins - but this can be more memory efficient because there are only at most 256 valid offsets inside each chunk, whereas the list needs to store full pointers.
Note that either way, if you don't want the free space inside each chunk to get fragmented, you'll want a quick way to remove chunks from bins in your list - which means using doubly linked lists. Obviously that adds additional memory overhead, but it might still be preferable to doing periodic free space defragmentation on the whole list.

Related

Should I use boost fast pool allocator for following?

I have a server that throughout the course of 24 hours keeps adding new items to a set. Elements are not deleted over the 24 period, just new elements keep getting inserted.
Then at end of period the set is cleared, and new elements start getting added again for another 24 hours.
Do you think a fast pool allocator would be useful here as to reuse the memory and possibly help with fragmentation?
The set grows to around 1 million elements. Each element is about 1k.

It's highly unlikely …but you are of course free to test it in your program.
For a collection of that size and allocation pattern (more! more! more! + grow! grow! grow!), you should use an array of vectors. Just keep it in contiguous blocks and reserve() when they are created and you never need to reallocate/resize or waste space and bandwidth traversing lists. vector is going to be best for your memory layout with a collection that large. Not one big vector (which would take a long time to resize), but several vectors, each which represent chunks (ideal chunk size can vary by platform -- I'd start with 5MB each and measure from there). If you follow, you see there is no need to resize or reuse memory; just create an allocation every few minutes for the next N objects -- there is no need for high frequency/speed object allocation and recreation.
The thing about a pool allocator would suggest you want a lot of objects which have discontiguous allocations, lots of inserts and deletes like a list of big allocations -- this is bad for a few reasons. If you want to create an implementation which optimizes for contiguous allocation at this size, just aim for the blocks with vectors approach. Allocation and lookup will both be close to minimal. At that point, allocation times should be tiny (relative to the other work you do). Then you will also have nothing unusual or surprising about your allocation patterns. However, the fast pool allocator suggests you treat this collection as a list, which will have terrible performance for this problem.
Once you implement that block+vector approach, a better performance comparison (at that point) would be to compare boost's pool_allocator vs std::allocator. Of course, you could test all three, but memory fragmentation is likely going to be reduced far more by that block of vectors approach, if you implement it correctly. Reference:
If you are seriously concerned about performance, use fast_pool_allocator when dealing with containers such as std::list, and use pool_allocator when dealing with containers such as std::vector.

Why is deque using so much more RAM than vector in C++?

I have a problem I am working on where I need to use some sort of 2 dimensional array. The array is fixed width (four columns), but I need to create extra rows on the fly.
To do this, I have been using vectors of vectors, and I have been using some nested loops that contain this:
array.push_back(vector<float>(4));
array[n][0] = a;
array[n][1] = b;
array[n][2] = c;
array[n][3] = d;
n++
to add the rows and their contents. The trouble is that I appear to be running out of memory with the number of elements I was trying to create, so I reduced the number that I was using. But then I started reading about deque, and thought it would allow me to use more memory because it doesn't have to be contiguous. I changed all mentions of "vector" to "deque", in this loop, as well as all declarations. But then it appeared that I ran out of memory again, this time with even with the reduced number of rows.
I looked at how much memory my code is using, and when I am using deque, the memory rises steadily to above 2GB, and the program closes soon after, even when using the smaller number of rows. I'm not sure exactly where in this loop it is when it runs out of memory.
When I use vectors, the memory usage (for the same number of rows) is still under 1GB, even when the loop exits. It then goes on to a similar loop where more rows are added, still only reaching about 1.4GB.
So my question is. Is this normal for deque to use more than twice the memory of vector, or am I making an erroneous assumption in thinking I can just replace the word "vector" with "deque" in the declarations/initializations and the above code?
Thanks in advance.
I'm using:
MS Visual C++ 2010 (32-bit)
Windows 7 (64-bit)

The real answer here has little to do with the core data structure. The answer is that MSVC's implementation of std::deque is especially awful and degenerates to an array of pointers to individual elements, rather than the array of arrays it should be. Frankly, only twice the memory use of vector is surprising. If you had a better implementation of deque you'd get better results.

It all depends on the internal implementation of deque (I won't speak about vector since it is relatively straightforward).
Fact is, deque has completely different guarantees than vector (the most important one being that it supports O(1) insertion at both ends while vector only supports O(1) insertion at the back). This in turn means the internal structures managed by deque have to be more complex than vector.
To allow that, a typical deque implementation will split its memory in several non-contiguous blocks. But each individual memory block has a fixed overhead to allow the memory management to work (eg. whatever the size of the block, the system may need another 16 or 32 bytes or whatever in addition, just for bookkeeping). Since, contrary to a vector, a deque requires many small, independent blocks, the overhead stacks up which can explain the difference you see. Also note that those individual memory blocks need to be managed (maybe in separate structures?), which probably means some (or a lot of) additional overhead too.
As for a way to solve your problem, you could try what #BasileStarynkevitch suggested in the comments, this will indeed reduce your memory usage but it will get you only so far because at some point you'll still run out of memory. And what if you try to run your program on a machine that only has 256MB RAM? Any other solution which goal is to reduce your memory footprint while still trying to keep all your data in memory will suffer from the same problems.
A proper solution when handling large datasets like yours would be to adapt your algorithms and data structures in order to be able to handle small partitions at a time of your whole dataset, and load/save those partitions as needed in order to make room for the other partitions. Unfortunately since it probably means disk access, it also means a big drop in performance but hey, you can't eat the cake and have it too.

Theory
There two common ways to efficiently implement a deque: either with a modified dynamic array or with a doubly linked list.
The modified dynamic array uses is basically a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end.
There are several implementations of modified dynamic array:
Allocating deque contents from the center of the underlying array,
and resizing the underlying array when either end is reached. This
approach may require more frequent resizings and waste more space,
particularly when elements are only inserted at one end.
Storing deque contents in a circular buffer, and only resizing when
the buffer becomes full. This decreases the frequency of resizings.
Storing contents in multiple smaller arrays, allocating additional
arrays at the beginning or end as needed. Indexing is implemented by
keeping a dynamic array containing pointers to each of the smaller
arrays.
Conclusion
Different libraries may implement deques in different ways, but generally as a modified dynamic array. Most likely your standard library uses the approach #1 to implement std::deque, and since you append elements only from one end, you ultimately waste a lot of space. For that reason, it makes an illusion that std::deque takes up more space than usual std::vector.
Furthermore, if std::deque would be implemented as doubly-linked list, that would result in a waste of space too since each element would need to accommodate 2 pointers in addition to your custom data.
Implementation with approach #3 (modified dynamic array approach too) would again result in a waste of space to accommodate additional metadata such as pointers to all those small arrays.
In any case, std::deque is less efficient in terms of storage than plain old std::vector. Without knowing what do you want to achieve I cannot confidently suggest which data structure do you need. However, it seems like you don't even know what deques are for, therefore, what you really want in your situation is std::vector. Deques, in general, have different application.

Deque can have additional memory overhead over vector because it's made of a few blocks instead of contiguous one.
From en.cppreference.com/w/cpp/container/deque:
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.

The primary issue is running out of memory.
So, do you need all the data in memory at once?
You may never be able to accomplish this.
Partial Processing
You may want to consider processing the data into "chunks" or smaller sub-matrices. For example, using the standard rectangular grid:
Read data of first quadrant.
Process data of first quandrant.
Store results (in a file) of first quandrant.
Repeat for remaining quandrants.
Searching
If you are searching for a particle or a set of datum, you can do that without reading in the entire data set into memory.
Allocate a block (array) of memory.
Read a portion of the data into this block of memory.
Search the block of data.
Repeat steps 2 and 3 until the data is found.
Streaming Data
If your application is receiving the raw data from an input source (other than a file), you will want to store the data for later processing.
This will require more than one buffer and is more efficient using at least two threads of execution.
The Reading Thread will be reading data into a buffer until the buffer is full. When the buffer is full, it will read data into another empty one.
The Writing Thread will initially wait until either the first read buffer is full or the read operation is finished. Next, the Writing Thread takes data out of the read buffer and writes to a file. The Write Thread then starts writing from the next read buffer.
This technique is called Double Buffering or Multiple Buffering.
Sparse Data
If there is a lot of zero or unused data in the matrix, you should try using Sparse Matrices. Essentially, this is a list of structures that hold the data's coordinates and the value. This also works when most of the data is a common value other than zero. This saves a lot of memory space; but costs a little bit more execution time.
Data Compression
You could also change your algorithms to use data compression. The idea here is to store the data location, value and the number or contiguous equal values (a.k.a. runs). So instead of storing 100 consecutive data points of the same value, you would store the starting position (of the run), the value, and 100 as the quantity. This saves a lot of space, but requires more processing time when accessing the data.
Memory Mapped File
There are libraries that can treat a file as memory. Essentially, they read in a "page" of the file into memory. When the requests go out of the "page", they read in another page. All this is performed "behind the scenes". All you need to do is treat the file like memory.
Summary
Arrays and deques are not your primary issue, quantity of data is. Your primary issue can be resolved by processing small pieces of data at a time, compressing the data storage, or treating the data in the file as memory. If you are trying to process streaming data, don't. Ideally, streaming data should be placed into a file and then processed later.
A historical purpose of a file is to contain data that doesn't fit into memory.

Linked list vs dynamic array for implementing a stack using vector class

I was reading up on the two different ways of implementing a stack: linked list and dynamic arrays. The main advantage of a linked list over a dynamic array was that the linked list did not have to be resized while a dynamic array had to be resized if too many elements were inserted hence wasting alot of time and memory.
That got me wondering if this is true for C++ (as there is a vector class which automatically resizes whenever new elements are inserted)?

It's difficult to compare the two, because the patterns of their memory usage are quite different.
Vector resizing
A vector resizes itself dynamically as needed. It does that by allocating a new chunk of memory, moving (or copying) data from the old chunk to the new chunk, the releasing the old one. In a typical case, the new chunk is 1.5x the size of the old (contrary to popular belief, 2x seems to be quite unusual in practice). That means for a short time while reallocating, it needs memory equal to roughly 2.5x as much as the data you're actually storing. The rest of the time, the "chunk" that's in use is a minimum of 2/3rds full, and a maximum of completely full. If all sizes are equally likely, we can expect it to average about 5/6ths full. Looking at it from the other direction, we can expect about 1/6th, or about 17% of the space to be "wasted" at any given time.
When we do resize by a constant factor like that (rather than, for example, always adding a specific size of chunk, such as growing in 4Kb increments) we get what's called amortized constant time addition. In other words, as the array grows, resizing happens exponentially less often. The average number of times items in the array have been copied tends to a constant (usually around 3, but depends on the growth factor you use).
linked list allocations
Using a linked list, the situation is rather different. We never see resizing, so we don't see extra time or memory usage for some insertions. At the same time, we do see extra time and memory used essentially all the time. In particular, each node in the linked list needs to contain a pointer to the next node. Depending on the size of the data in the node compared to the size of a pointer, this can lead to significant overhead. For example, let's assume you need a stack of ints. In a typical case where an int is the same size as a pointer, that's going to mean 50% overhead -- all the time. It's increasingly common for a pointer to be larger than an int; twice the size is fairly common (64-bit pointer, 32-bit int). In such a case, you have ~67% overhead -- i.e., obviously enough, each node devoting twice as much space to the pointer as the data being stored.
Unfortunately, that's often just the tip of the iceberg. In a typical linked list, each node is dynamically allocated individually. At least if you're storing small data items (such as int) the memory allocated for a node may be (usually will be) even larger than the amount you actually request. So -- you ask for 12 bytes of memory to hold an int and a pointer -- but the chunk of memory you get is likely to be rounded up to 16 or 32 bytes instead. Now you're looking at overhead of at least 75% and quite possibly ~88%.
As far as speed goes, the situation is rather similar: allocating and freeing memory dynamically is often quite slow. The heap manager typically has blocks of free memory, and has to spend time searching through them to find the block that's most suited to the size you're asking for. Then it (typically) has to split that block into two pieces, one to satisfy your allocation, and another of the remaining memory it can use to satisfy other allocations. Likewise, when you free memory, it typically goes back to that same list of free blocks and checks whether there's an adjoining block of memory already free, so it can join the two back together.
Allocating and managing lots of blocks of memory is expensive.
cache usage
Finally, with recent processors we run into another important factor: cache usage. In the case of a vector, we have all the data right next to each other. Then, after the end of the part of the vector that's in use, we have some empty memory. This leads to excellent cache usage -- the data we're using gets cached; the data we're not using has little or no effect on the cache at all.
With a linked list, the pointers (and probable overhead in each node) are distributed throughout our list. I.e., each piece of data we care about has, right next to it, the overhead of the pointer, and the empty space allocated to the node that we're not using. In short, the effective size of the cache is reduced by about the same factor as the overall overhead of each node in the list -- i.e., we might easily see only 1/8th of the cache storing the date we care about, and 7/8ths devoted to storing pointers and/or pure garbage.
Summary
A linked list can work well when you have a relatively small number of nodes, each of which is individually quite large. If (as is more typical for a stack) you're dealing with a relatively large number of items, each of which is individually quite small, you're much less likely to see a savings in time or memory usage. Quite the contrary, for such cases, a linked list is much more likely to basically waste a great deal of both time and memory.

Yes, what you say is true for C++. For this reason, the default container inside std::stack, which is the standard stack class in C++, is neither a vector nor a linked list, but a double ended queue (a deque). This has nearly all the advantages of a vector, but it resizes much better.
Basically, an std::deque is a linked list of arrays of sorts internally. This way, when it needs to resize, it just adds another array.

First, the performance trade-offs between linked-lists and dynamic arrays are a lot more subtle than that.
The vector class in C++ is, by requirement, implemented as a "dynamic array", meaning that it must have an amortized-constant cost for inserting elements into it. How this is done is usually by increasing the "capacity" of the array in a geometric manner, that is, you double the capacity whenever you run out (or come close to running out). In the end, this means that a reallocation operation (allocating a new chunk of memory and copying the current content into it) is only going to happen on a few occasions. In practice, this means that the overhead for the reallocations only shows up on performance graphs as little spikes at logarithmic intervals. This is what it means to have "amortized-constant" cost, because once you neglect those little spikes, the cost of the insert operations is essentially constant (and trivial, in this case).
In a linked-list implementation, you don't have the overhead of reallocations, however, you do have the overhead of allocating each new element on freestore (dynamic memory). So, the overhead is a bit more regular (not spiked, which can be needed sometimes), but could be more significant than using a dynamic array, especially if the elements are rather inexpensive to copy (small in size, and simple object). In my opinion, linked-lists are only recommended for objects that are really expensive to copy (or move). But at the end of the day, this is something you need to test in any given situation.
Finally, it is important to point out that locality of reference is often the determining factor for any application that makes extensive use and traversal of the elements. When using a dynamic array, the elements are packed together in memory one after the other and doing an in-order traversal is very efficient as the CPU can preemptively cache the memory ahead of the reading / writing operations. In a vanilla linked-list implementation, the jumps from one element to the next generally involves a rather erratic jumps between wildly different memory locations, which effectively disables this "pre-fetching" behavior. So, unless the individual elements of the list are very big and operations on them are typically very long to execute, this lack of pre-fetching when using a linked-list will be the dominant performance problem.
As you can guess, I rarely use a linked-list (std::list), as the number of advantageous applications are few and far between. Very often, for large and expensive-to-copy objects, it is often preferable to simply use a vector of pointers (you get basically the same performance advantages (and disadvantages) as a linked list, but with less memory usage (for linking pointers) and you get random-access capabilities if you need it).
The main case that I can think of, where a linked-list wins over a dynamic array (or a segmented dynamic array like std::deque) is when you need to frequently insert elements in the middle (not at either ends). However, such situations usually arise when you are keeping a sorted (or ordered, in some way) set of elements, in which case, you would use a tree structure to store the elements (e.g., a binary search tree (BST)), not a linked-list. And often, such trees store their nodes (elements) using a semi-contiguous memory layout (e.g., a breadth-first layout) within a dynamic array or segmented dynamic array (e.g., a cache-oblivious dynamic array).

Yes, it's true for C++ or any other language. Dynamic array is a concept. The fact that C++ has vector doesn't change the theory. The vector in C++ actually does the resizing internally so this task isn't the developers' responsibility. The actual cost doesn't magically disappear when using vector, it's simply offloaded to the standard library implementation.

std::vector is implemented using a dynamic array, whereas std::list is implemented as a linked list. There are trade-offs for using both data structures. Pick the one that best suits your needs.
As you indicated, a dynamic array can take a larger amount of time adding an item if it gets full, as it has to expand itself. However, it is faster to access since all of its members are grouped together in memory. This tight grouping also usually makes it more cache-friendly.
Linked lists don't need to resize ever, but traversing them takes longer as the CPU must jump around in memory.

That got me wondering if this is true for c++ as there is a vector class which automatically resizes whenever new elements are inserted.
Yes, it still holds, because a vector resize is a potentially expensive operation. Internally, if the pre-allocated size for the vector is reached and you attempt to add new elements, a new allocation takes place and the old data is moved to the new memory location.

From the C++ documentation:
vector::push_back - Add element at the end
Adds a new element at the end of the vector, after its current last element. The content of val is copied (or moved) to the new element.
This effectively increases the container size by one, which causes an automatic reallocation of the allocated storage space if -and only if- the new vector size surpasses the current vector capacity.

http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style
Skip to 44:40. You should prefer std::vector whenever possible over a std::list, as explained in the video, by Bjarne himself. Since std::vector stores all of it's elements next to each other, in memory, and because of that it will have the advantage of being cached in memory. And this is true for adding and removing elements from std::vector and also searching. He states that std::list is 50-100x slower than a std::vector.
If you really want a stack, you should really use std::stack instead of making your own.

suggestions for improving an allocator algorithm implementation

I have a Visual Studio 2008 C++ application where I'm using a custom allocator for standard containers such that their memory comes from a Memory Mapped File rather than the heap. This allocator is used for 4 different use cases:
104-byte fixed size structure std::vector< SomeType, MyAllocator< SomeType > > foo;
200-byte fixed size structure
304-byte fixed size structure
n-byte strings std::basic_string< char, std::char_traits< char >, MyAllocator< char > > strn;
I need to be able to allocate roughly 32MB total for each of these.
The allocator tracks memory usage using a std::map of pointers to allocation size. typedef std::map< void*, size_t > SuperBlock; Each SuperBlock represents 4MB of memory.
There is a std::vector< SuperBlock > of these in case one SuperBlock isn't enough space.
The algorithm used for the allocator goes like this:
For each SuperBlock: Is there space at the end of the SuperBlock? put the allocation there. (fast)
If not, search within each SuperBlock for an empty space of sufficient size and put the allocation there. (slow)
Still nothing? allocate another SuperBlock and put the allocation at the start of the new SuperBlock.
Unfortunately, step 2 can become VERY slow after a while. As copies of objects are made and temporary variables destroyed I get a lot of fragmentation. This causes a lot of deep searching within the memory structure. Fragmentation is in issue as I have a limited amount of memory to work with (see note below)
Can anybody suggest improvements to this algorithm that would speed up the process? Do I need two separate algorithms (1 for the fixed-size allocations and one for the string allocator)?
Note: For those that need a reason: I'm using this algorithm in Windows Mobile where there's a 32MB process slot limit to the Heap. So, the usual std::allocator won't cut it. I need to put the allocations in the 1GB Large Memory Area to have enough space and that's what this does.

Can you have a separate memory allocation pool for each different fixed-size type you are allocating? That way there won't be any fragmentation, because the allocated objects will always align on n-byte boundaries. That doesn't help for the variable-length strings, of course.
There's an example of small-object allocation in Alexandrescu's Modern C++ design that illustrates this principle and may give you some ideas.

For the fixed sized objects, you can create a fixed sized allocator. Basically you allocate blocks, partition into subblocks of the appropriate size and create a linked list with the result. Allocating from such a block is O(1) if there is memory available (just remove the first element from the list and return a pointer to it) as is deallocation (add the block to the free list). During allocation, if the list is empty, grab a new superblock, partition and add all blocks into the list.
For the variable sized list, you can simplify it to the fixed size block by allocating only blocks of known sizes: 32 bytes, 64 bytes, 128 bytes, 512 bytes. You will have to analyze the memory usage to come up with the different buckets so that you don't waste too much memory. For large objects, you can go back to a dynamic size allocation pattern, that will be slow, but hopefully the amount of large objects is limited.

Building on Tim's answer, I would personally use something akin to BiBOP.
The basic idea is simple: use fixed-size pools.
There are some refinements to that.
First, the size of the pools is generally fixed. It depends on your allocation routine, typically if you know the OS you're working on map at least 4KB at once when you use malloc, then you use that value. For a memory mapped file, you might be able to increase this.
The advantage of fixed-size pools is that it nicely fights off fragmentation. All pages being of the same size, you can recycle an empty 256-bytes page into a 128-bytes page easily.
There is still some fragmentation for large objects, which are typically allocated outside of this system. But it's low, especially if you fit large objects into a multiple of the page size, this way the memory will be easy to recycle.
Second, how to handle the pools ? Using linked-lists.
The pages are typically untyped (by themselves), so you have a free-list of pages in which to prepare new pages and put "recycled" pages.
For each size category you then have a list of "occupied" pages, in which memory has been allocated. For each page you keep:
the allocation size (for this page)
the number of allocated objects (to check for emptiness)
a pointer to the first free cell
a pointer to the previous and next pages (might point to the "head" of the list)
Each free-cell is itself a pointer (or index, depending on the size you have) to the next free-cell.
The list of "occupied" pages of a given size is simply managed:
on deletion: if you empty the page, then remove it from the list and push it into the recycled pages, otherwise, update the free-cell list of this page (note: finding the beginning of the current page is usually a simple modulo operation on the address)
on insertion: search starting from head, as soon as you find a non-full page, move it in front of the list (if it's not already) and insert your item
This scheme is really performant memory-wise, with only a single page reserved for indexing.
For multi-threaded / multi-processes applications, you'll need to add synchronization (a mutex per page typically), in case you could get inspiration from Google's tcmalloc (try and find another page instead of blocking, use a thread-local cache to remember which page you last used).
Having said that, have you tried Boost.Interprocess ? It provides allocators.

For the fixed sizes you can easily use a small memory allocator type of allocator where you allocate a large block that's split into fixed-size chunks. You then create a vector of pointers to available chunks and pop/push as you allocate/free. This is very fast.
For variable length items, it's harder: You either have to deal with searching for available contiguous space or use some other approach. You might consider maintaining another map of all the free nodes ordered by block size, so you can lower_bound the map and if the next available node is say only 5% too big return it instead of trying to find usable available space of the exact size.

My inclination for variable-sized items would be to, if practical, avoid holding direct pointers to data and instead keep handles. Each handle would be a the index of a superblock, and an index to an item within the superblock. Each superblock would have an item-list allocated top-down and items allocated bottom-up. Each item's allocation would be preceded by its length, and by the index of the item it represents; use one bit of the index to indicate whether an item is 'pinned'.
If an item fits after the last allocated item, simply allocate it. If it would hit a pinned item, move the next-allocation mark past the pinned item, find the next higher pinned item, and try the allocation again. If the item would collide with the item-list but there's enough free space somewhere, compactify the block's contents (if one or more items are pinned, it may be better to use another superblock if one is available). Depending upon usage patterns, it may be desirable to start by only compactifying the stuff that was added since the last collection; if that doesn't provide enough space, then compactify everything.
Of course, if only have a few discrete sizes of items, you can use simple fixed-sized-chunk allocators.

I agree with Tim - use memory pools to avoid fragmentation.
However you may be able to avoid some of the churn by storing pointers rather than objects in your vectors, perhaps ptr_vector?

std::string and its automatic memory resizing

I'm pretty new to C++, but I know you can't just use memory willy nilly like the std::string class seems to let you do. For instance:
std::string f = "asdf";
f += "fdsa";
How does the string class handle getting larger and smaller? I assume it allocates a default amount of memory and if it needs more, it news a larger block of memory and copies itself over to that. But wouldn't that be pretty inefficient to have to copy the whole string every time it needed to resize? I can't really think of another way it could be done (but obviously somebody did).
And for that matter, how do all the stdlib classes like vector, queue, stack, etc handle growing and shrinking so transparently?

Your analysis is correct — it is inefficient to copy the string every time it needs to resize. That's why common advice discourages that use pattern. Use the string's reserve function to ask it to allocate enough memory for what you intend to store in it. Then further operations will fill that memory. (But if your hint turns out to be too small, the string will still grow automatically, too.)
Containers will also usually try to mitigate the effects of frequent re-allocation by allocating more memory than they need. A common algorithm is that when a string finds that it's out of space, it doubles its buffer size instead of just allocating the minimum required to hold the new value. If the string is being grown one character at a time, this doubling algorithm reduces the time complexity to amortized linear time (instead of quadratic time). It also reduces the program's susceptibility to memory fragmentation.

Usually, there's a doubling algorithm. In other words, when it fills the current buffer, it allocates a new buffer that's twice as big, and then copies the current data over. This results in fewer allocate/copy operations than the alternative of growing by a single allocation block.

Although I do not know the exact implementation of std::string, most data structures that need to handle dynamic memory growth do so by doing exactly what you say - allocate a default amount of memory, and if more is needed then create a bigger block and copy yourself over.
The way you get around the obvious inefficiency problem is to allocate more memory than you need. The ratio of used memory:total memory of a vector/string/list/etc is often called the load factor (also used for hash tables in a slightly different meaning). Usually it's a 1:2 ratio - that is, you assign twice the memory you need. When you run out of space, you assign a new amount of memory twice your current amount and use that. This means that over time, if you continue to add things to the vector/string/etc, you need to copy over the item less and less (as the memory creation is exponential, and your inserting of new items is of course linear), and so the time taken for this method of memory handling is not as large as you might think. By the principles of Amortized Analysis, you can then see that inserting m items into a vector/string/list using this method is only Big-Theta of m, not m2.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js