Efficient allocation of dynamic arrays within mmap'ed memory

Efficient allocation of dynamic arrays within mmap'ed memory - c++

I have a very large (fixed at runtime, around 10 - 30 million) number of arrays. Each array is of between 0 and 128 elements that are each 6 bytes.
I need to store all the arrays in mmap'ed memory (so I can't use malloc), and the arrays need to be able to grow dynamically (up to 128 elements, and the arrays never shrink).
I implemented a naive approach of having an int array representing the state of each block of 6 bytes in the mmap'ed memory. A value of 0xffffffff at an offset represents the corresponding offset in the mmap'ed memory being free, any other value is the id of the array (which is needed for defragmentation of the blocks in my current implementation, blocks can't be moved without knowing the id of their array to update other data structures). On allocation and when an array outgrows its allocation it would simply iterate until it found enough free blocks, and insert at the corresponding offset.
This is what the allocation array and mmap'ed memory look like, kindof:
| 0xffffffff | 0xfffffff | 1234 | 1234 | 0xffffffff | ...
-----------------------------------------------------------------
| free | free |array1234[0]|array1234[1]| free | ...
This approach though has a memory overhead of offset of furthest used block in mmap'ed memory x 4 (4 bytes ber int).
What better approaches are there for this specific case?
My ideal requirements for this are:
Memory overhead (any allocation tables + unused space) <= 1.5 bits per element + 4*6 bytes per array
O(1) allocation and growing of arrays

Boost.Interprocess seems to have a neat implementation of managed memory-mapped files, with provisions similar to malloc/free but for mapped files (i.e. you have a handle to a suitably-large memory-mapped file and you can ask the library to sub-allocate an unused part of the file for something, like an array). From the documentation:
Boost.Interprocess offers some basic classes to create shared memory
objects and file mappings and map those mappable classes to the
process' address space.
However, managing those memory segments is not not easy for
non-trivial tasks. A mapped region is a fixed-length memory buffer and
creating and destroying objects of any type dynamically, requires a
lot of work, since it would require programming a memory management
algorithm to allocate portions of that segment. Many times, we also
want to associate names to objects created in shared memory, so all
the processes can find the object using the name.
Boost.Interprocess offers 4 managed memory segment classes:
To manage a shared memory mapped region (basic_managed_shared_memory class).
To manage a memory mapped file (basic_managed_mapped_file).
To manage a heap allocated (operator new) memory buffer (basic_managed_heap_memory class).
To manage a user provided fixed size buffer (basic_managed_external_buffer class).
The most important services of a managed memory segment are:
Dynamic allocation of portions of a memory the segment.
Construction of C++ objects in the memory segment. These objects can be anonymous or we can associate a name to them.
Searching capabilities for named objects.
Customization of many features: memory allocation algorithm, index types or character types.
Atomic constructions and destructions so that if the segment is shared between two processes it's impossible to create two objects
associated with the same name, simplifying synchronization.

How many mmap'ed areas can you afford? If 128 is ok then I'd created 128 areas corresponding to the all possible sizes of your arrays. And ideally a linked list of free entries for each area. In this case you will get fixed size of the record for each area. And growing an array from N to N + 1 is an operation of moving the data from area[N] to area[N + 1] at the end (if the linked list of empty entries for N + 1 is empty) or in an empty slot if not. For area[N] the slot removed is added to its list of free entries
UPDATE: The linked list can be embed in the main structures. So no extra allocations is needed, the first field (int) inside every possible record (from size 1 to 128) can be an index to the next free entry. For allocated entries it's always void (0xffffffff), but if an entry is free then this index becomes a member of the corresponding linked chain.

I devised and ultimately went with a memory allocation algorithm that just about lives up to my requirements, with O(1) amortised, very little fragmentation and very little overhead. Feel free to comment and I'll detail it when I get a chance.

Related

How C++ heap managers track the size of allocated objects?

I want to estimate memory consumption in C++ when I allocate objects in the heap. I can start my calculation with sizeof(object) and round it up to the nearest multiple of the heap block (typically 8 bytes). But if the whole allocated block goes to the allocated object, how can the heap manager tell the size of the object when I ask it to delete the pointer?
If the heap manager tracks the size of each object, does it mean that I should add ~4 bytes per allocated object in my calculation to the total memory consumption for heap manager internal expenses? Or does it store this information in much more compact form? What are extra costs (memory-wise) of heap memory allocation?
I understand that my question is very implementation specific, but I appreciate any hints about the heap metadata storage of major implementations such as gcc (or maybe it is about libc).

Heap allocators are not free. There is the cost per block of allocation (both in size and perhaps search if using a find best algorithm), the cost of joining blocks upon free, and any lost size per block when the size requested is smaller than the block size returned. There is also a cost in terms of memory fragmentation. Consider placing a small 1 byte allocation in the middle of your heap. At this point, you can no longer return a contiguous block larger than 1/2 the heap - half your heap is fragmented. Good allocators fight all the above and try to maximize all benefits.
Consider the following allocation system (used in many real world applications over a decade on numerous handheld game devices.)
Create a primary heap where each allocation has a prev ptr, next ptr, size, and possibly other info. Round it to 16 bytes per entry. Store this info before or after the actual memory pointer being returned - your choice as there are advantages to each. Yes, you are allocating requested size + 16 bytes here.
Now just keep a pointer to a free list and possibly a used list.
Allocation is done by finding a block on the free list big enough for our use and splitting it into the size requested and the remainder (first fit), or by searching the whole list for as exact a match as possible (best fit). Simple enough.
Freeing is moving the current item back into the free list, joining areas adjacent to each other if possible. You can see how this can go to O(n).
For smaller allocations, get a single allocation (either from newly created heap, or from global memory) that will be your unit allocation zone. Split this area up into "block size" chunk addresses and push these addresses onto a free stack. Allocation is popping an address from this list. Freeing is just pushing the allocation back to the list - both O(1).
Then inside your malloc/new/etc, check size if it is within the unit size, allocate from the unit allocator, otherwise use the O(n) allocator. My studies have shown that you can get 90-95% of allocations to fit within the unit allocator's block size without too much issue.
Also, you can allocate slabs of memory for memory pools and leave them allocated while using them over and over. A few larger allocations are much cheaper to manage (Unix systems use this alot...)
Advantages:
All unit allocations have no external fragmentation.
Small allocations run constant time.
Larger allocation can be tailored to the exact size of the data requested.
Disadvantages:
You pay an internal fragmentation cost when you want memory smaller than the block requsted.
Many, many allocations and frees outside your small block size will eventually fragment your memory. This can lead to all sorts of unpleasantness. This can be managed with OS help or with care on alloc/free order.
Once you get 1000s of large allocations going, there is a CPU price. Slabs / multiple heaps can be used to fix this as well.
There are many, many schemes out there, but this one is simple and has been used in commercial apps for ages though, so I wanted to start here.

Managing a Contiguous Chunk of Memory without Malloc/New or Free/Delete

How would one go about creating a custom MemoryManager to manage a given, contiguous chunk of memory without the aid of other memory managers (such as Malloc/New) in C++?
Here's some more context:
MemManager::MemManager(void* memory, unsigned char totalsize)
{
Memory = memory;
MemSize = totalsize;
}
I need to be able to allocate and free up blocks of this contiguous memory using a MemManager. The constructor is given the total size of the chunk in bytes.
An Allocate function should take in the amount of memory required in bytes and return a pointer to the start of that block of memory. If no memory is remaining, a NULL pointer is returned.
A Deallocate function should take in the pointer to the block of memory that must be freed and give it back to the MemManager for future use.
Note the following constraints:
-Aside from the chunk of memory given to it, the MemManager cannot use ANY dynamic memory
-As originally specified, the MemManager CANNOT use other memory managers to perform its functions, including new/malloc and delete/free
I have received this question on several job interviews already, but even hours of researching online did not help me and I have failed every time. I have found similar implementations, but they have all either used malloc/new or were general-purpose and requested memory from the OS, which I am not allowed to do.
Note that I am comfortable using malloc/new and free/delete and have little trouble working with them.
I have tried implementations that utilize node objects in a LinkedList fashion that point to the block of memory allocated and state how many bytes were used. However, with those implementations I was always forced to create new nodes onto the stack and insert them into the list, but as soon as they went out of scope the entire program broke since the addresses and memory sizes were lost.
If anyone has some sort of idea of how to implement something like this, I would greatly appreciate it. Thanks in advance!
EDIT: I forgot to directly specify this in my original post, but the objects allocated with this MemManager can be different sizes.
EDIT 2: I ended up using homogenous memory chunks, which was actually very simple to implement thanks to the information provided by the answers below. The exact rules regarding the implementation itself were not specified, so I separated each block into 8 bytes. If the user requested more than 8 bytes, I would be unable to give it, but if the user requested fewer than 8 bytes (but > 0) then I would give extra memory. If the amount of memory passed in was not divisible by 8 then there would be wasted memory at the end, which I suppose is much better than using more memory than you're given.

I have tried implementations that utilize node objects in a LinkedList
fashion that point to the block of memory allocated and state how many
bytes were used. However, with those implementations I was always
forced to create new nodes onto the stack and insert them into the
list, but as soon as they went out of scope the entire program broke
since the addresses and memory sizes were lost.
You're on the right track. You can embed the LinkedList node in the block of memory you're given with reinterpret_cast<>. Since you're allowed to store variables in the memory manager as long as you don't dynamically allocate memory, you can track the head of the list with a member variable. You might need to pay special attention to object size (Are all objects the same size? Is the object size greater than the size of your linked list node?)
Assuming the answers to the previous questions to be true, you can then process the block of memory and split it off into smaller, object sized chunks using a helper linked list that tracks free nodes. Your free node struct will be something like
struct FreeListNode
{
FreeListNode* Next;
};
When allocating, all you do is remove the head node from the free list and return it. Deallocating is just inserting the freed block of memory into the free list. Splitting the block of memory up is just a loop:
// static_cast only needed if constructor takes a void pointer; can't perform pointer arithmetic on void*
char* memoryEnd = static_cast<char*>(memory) + totalSize;
for (char* blockStart = block; blockStart < memoryEnd; blockStart += objectSize)
{
FreeListNode* freeNode = reinterpret_cast<FreeListNode*>(blockStart);
freeNode->Next = freeListHead;
freeListHead = freeNode;
}
As you mentioned the Allocate function takes in the object size, the above will need to be modified to store metadata. You can do this by including the size of the free block in the free list node data. This removes the need to split up the initial block, but introduces complexity in Allocate() and Deallocate(). You'll also need to worry about memory fragmentation, because if you don't have a free block with enough memory to store the requested amount, there's nothing that you can do other than to fail the allocation. A couple of Allocate() algorithms might be:
1) Just return the first available block large enough to hold the request, updating the free block as necessary. This is O(n) in terms of searching the free list, but might not need to search a lot of free blocks and could lead to fragmentation problems down the road.
2) Search the free list for the block that has the smallest amount free in order to hold the memory. This is still O(n) in terms of searching the free list because you have to look at every node to find the least wasteful one, but can help delay fragmentation problems.
Either way, with variable size, you have to store metadata for allocations somewhere as well. If you can't dynamically allocate at all, the best place is before or after the user requested block; you can add features to detect buffer overflows/underflows during Deallocate() if you want to add padding that is initialized to a known value and check the padding for a difference. You can also add a compact step as mentioned in another answer if you want to handle that.
One final note: you'll have to be careful when adding metadata to the FreeListNode helper struct, as the smallest free block size allowed is sizeof(FreeListNode). This is because you are storing the metadata in the free memory block itself. The more metadata you find yourself needing to store for your internal purposes, the more wasteful your memory manager will be.

When you manage memory, you generally want to use the memory you manage to store any metadata you need. If you look at any of the implementations of malloc (ptmalloc, phkmalloc, tcmalloc, etc...), you'll see that this is how they're generally implemented (neglecting any static data of course). The algorithms and structures are very different, for different reasons, but I'll try to give a little insight into what goes into generic memory management.
Managing homogeneous chunks of memory is different than managing non-homogeneous chunks, and it can be a lot simpler. An example...
MemoryManager::MemoryManager() {
this->map = std::bitset<count>();
this->mem = malloc(size * count);
for (int i = 0; i < count; i++)
this->map.set(i);
}
Allocating is a matter of finding the next bit in the std::bitset (compiler might optimize), marking the chunk as allocated and returning it. De-allocation just requires calculating the index, and marking as unallocated. A free list is another way (what's described here), but it's a little less memory efficient, and might not use CPU cache well.
A free list can be the basis for managing non-homogenous chunks of memory though. With this, you need to store the size of the chunks, in addition to the next pointer in the chunk of memory. The size lets you split larger chunks into smaller chunks. This generally leads to fragmentation though, since merging chunks is non-trivial. This is why most data structures keep lists of same sized chunks, and try to map requests as closely as possible.

Memory Demands: Heap vs Stack in C++

So I had a strange experience this evening.
I was working on a program in C++ that required some way of reading a long list of simple data objects from file and storing them in the main memory, approximately 400,000 entries. The object itself is something like:
class Entry
{
public:
Entry(int x, int y, int type);
Entry(); ~Entry();
// some other basic functions
private:
int m_X, m_Y;
int m_Type;
};
Simple, right? Well, since I needed to read them from file, I had some loop like
Entry** globalEntries;
globalEntries = new Entry*[totalEntries];
entries = new Entry[totalEntries];// totalEntries read from file, about 400,000
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = new Entry(.......);
}
That addition to the program added about 25 to 35 megabytes to the program when I tracked it on the task manager. A simple change to stack allocation:
Entry* globalEntries;
globalEntries = new Entry[totalEntries];
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = Entry(.......);
}
and suddenly it only required 3 megabytes. Why is that happening? I know pointer objects have a little bit of extra overhead to them (4 bytes for the pointer address), but it shouldn't be enough to make THAT much of a difference. Could it be because the program is allocating memory inefficiently, and ending up with chunks of unallocated memory in between allocated memory?

Your code is wrong, or I don't see how this worked. With new Entry [count] you create a new array of Entry (type is Entry*), yet you assign it to Entry**, so I presume you used new Entry*[count].
What you did next was to create another new Entry object on the heap, and storing it in the globalEntries array. So you need memory for 400.000 pointers + 400.000 elements. 400.000 pointers take 3 MiB of memory on a 64-bit machine. Additionally, you have 400.000 single Entry allocations, which will all require sizeof (Entry) plus potentially some more memory (for the memory manager -- it might have to store the size of allocation, the associated pool, alignment/padding, etc.) These additional book-keeping memory can quickly add up.
If you change your second example to:
Entry* globalEntries;
globalEntries = new Entry[count];
for (...) {
globalEntries [i] = Entry (...);
}
memory usage should be equal to the stack approach.
Of course, ideally you'll use a std::vector<Entry>.

First of all, without specifying which column exactly you were watching, the number in task manager means nothing. On a modern operating system it's difficult even to define what you mean with "used memory" - are we talking about private pages? The working set? Only the stuff that stays in RAM? does reserved but not committed memory count? Who pays for memory shared between processes? Are memory mapped file included?
If you are watching some meaningful metric, it's impossible to see 3 MB of memory used - your object is at least 12 bytes (assuming 32 bit integers and no padding), so 400000 elements will need about 4.58 MB. Also, I'd be surprised if it worked with stack allocation - the default stack size in VC++ is 1 MB, you should already have had a stack overflow.
Anyhow, it is reasonable to expect a different memory usage:
the stack is (mostly) allocated right from the beginning, so that's memory you nominally consume even without really using it for anything (actually virtual memory and automatic stack expansion makes this a bit more complicated, but it's "true enough");
the CRT heap is opaque to the task manager: all it sees is the memory given by the operating system to the process, not what the C heap has "really" in use; the heap grows (requesting memory to the OS) more than strictly necessary to be ready for further memory requests - so what you see is how much memory it is ready to give away without further syscalls;
your "separate allocations" method has a significant overhead. The all-contiguous array you'd get with new Entry[size] costs size*sizeof(Entry) bytes, plus the heap bookkeeping data (typically a few integer-sized fields); the separated allocations method costs at least size*sizeof(Entry) (size of all the "bare elements") plus size*sizeof(Entry *) (size of the pointer array) plus size+1 multiplied by the cost of each allocation. If we assume a 32 bit architecture with a cost of 2 ints per allocation, you quickly see that this costs size*24+8 bytes of memory, instead of size*12+8 for the contiguous array in the heap;
the heap normally really gives away blocks that aren't really the size you asked for, because it manages blocks of fixed size; so, if you allocate single objects like that you are probably paying also for some extra padding - supposing it has 16 bytes blocks, you are paying 4 bytes extra per element by allocating them separately; this moves out memory estimation to size*28+8, i.e. an overhead of 16 bytes per each 12-byte element.

C++ heap organisation - which data structure?

C++ has two main memory types: heap and stack. With stack everything is clear to me, but concerning heap a question remains:
How is heap memory organised? I read about the heap data structure, but that seems not the applicable to the heap, because in C++ I can access any data from the heap, not just a minimum or maximum value.
So my question is: how is heap memory organized in C++? When we read "is allocated on the heap" what memory organization do we have in mind?

A common (though certainly not the only) implementation of the free store is a linked list of free blocks of memory (i.e., blocks not currently in use). The heap manager will typically allocate large blocks of memory (e.g., 1 megabyte at a time) from the operating system, then split these blocks up into pieces for use by your code when you use new and such. When you use delete, the block of memory you quit using will be added as a node on that linked list.
There are various strategies for how you use those free blocks. Three common ones are best fit, worst fit and first fit.
In best fit, you try to find a free block that's closest in size to the required allocation. If it's larger than required (typically after rounding the allocation size) you split it into two pieces: one to return for the allocation, and a new (smaller) free block to put back on the list to meet other allocation requests. Although this can seem like a good strategy, it's often problematic. The problem is that when you find the closest fit, the left-over block is often too small to be of much use. After a short time, you end up with a huge number of tiny blocks of free space, none of which is good for much of anything.
Worst fit combats that by instead finding the worst fit among the free blocks -- IOW, the largest block of free space. When it splits that block, what's left over will be as large as possible, maximizing the chance that it'll be useful for other allocations.
First fist just walks through the list of free blocks, and (as you'd guess from the name) uses the first block that's large enough to meet the requirement.
Quite a few also start with a search for an exact fit, and use that in preference to splitting a block.
Quite a few also keep (for example) a number of separate linked lists for different allocation sizes to minimize searching for a block of the right size.
In quite a few cases, the manager also has some code to walk through the list of free blocks, to find any that are right next to each other. If it finds them, it will join the two small blocks into one larger bock. Sometimes this is done right when you free/delete a block. More often, it's done lazily, to avoid joining then re-splitting blocks when/if you're using a lot of the same size of blocks (which is fairly common).
Another possibility that's common when dealing with a large number of identically-sized items (especially small ones) is an array of blocks, with a bitset to specify which are free or in use. In this case, you typically keep track of an index into the bitset where the last free block was found. When a block is needed, just search forward from the last index until you find the next one for which the bitset says the block is free.

Heap has two main meanings with two different concepts:
A data structure for arrange data, a tree.
A pool of available memory. It manages memory by allocate/free available memory blocks.
An introduction about Heap in Memory Management:
The heap is the other dynamic memory area, allocated/freed by
malloc/free and their variants....

Here heap doesn't mean the heap data structure. That memory is termed as heap memory where global, static variables are stored. Also when we allocate memory dynamically, it is allocated in the heap memory.

When you read "allocated on the heap", that's usually an implementation-specific implementation of C++'s "dynamic storage duration". It means you've used new to allocate memory, and have a pointer to it that you now have to keep track of til you delete (or delete[]) it.
As for how it's organized, there's no one set way of doing so. At one point, if memory serves, the "heap" used to actually be a min-heap (organized by block size). That's implementation specific, though, and doesn't have to be that way. Any data structure will do, some better than others for given circumstances.

suggestions for improving an allocator algorithm implementation

I have a Visual Studio 2008 C++ application where I'm using a custom allocator for standard containers such that their memory comes from a Memory Mapped File rather than the heap. This allocator is used for 4 different use cases:
104-byte fixed size structure std::vector< SomeType, MyAllocator< SomeType > > foo;
200-byte fixed size structure
304-byte fixed size structure
n-byte strings std::basic_string< char, std::char_traits< char >, MyAllocator< char > > strn;
I need to be able to allocate roughly 32MB total for each of these.
The allocator tracks memory usage using a std::map of pointers to allocation size. typedef std::map< void*, size_t > SuperBlock; Each SuperBlock represents 4MB of memory.
There is a std::vector< SuperBlock > of these in case one SuperBlock isn't enough space.
The algorithm used for the allocator goes like this:
For each SuperBlock: Is there space at the end of the SuperBlock? put the allocation there. (fast)
If not, search within each SuperBlock for an empty space of sufficient size and put the allocation there. (slow)
Still nothing? allocate another SuperBlock and put the allocation at the start of the new SuperBlock.
Unfortunately, step 2 can become VERY slow after a while. As copies of objects are made and temporary variables destroyed I get a lot of fragmentation. This causes a lot of deep searching within the memory structure. Fragmentation is in issue as I have a limited amount of memory to work with (see note below)
Can anybody suggest improvements to this algorithm that would speed up the process? Do I need two separate algorithms (1 for the fixed-size allocations and one for the string allocator)?
Note: For those that need a reason: I'm using this algorithm in Windows Mobile where there's a 32MB process slot limit to the Heap. So, the usual std::allocator won't cut it. I need to put the allocations in the 1GB Large Memory Area to have enough space and that's what this does.

Can you have a separate memory allocation pool for each different fixed-size type you are allocating? That way there won't be any fragmentation, because the allocated objects will always align on n-byte boundaries. That doesn't help for the variable-length strings, of course.
There's an example of small-object allocation in Alexandrescu's Modern C++ design that illustrates this principle and may give you some ideas.

For the fixed sized objects, you can create a fixed sized allocator. Basically you allocate blocks, partition into subblocks of the appropriate size and create a linked list with the result. Allocating from such a block is O(1) if there is memory available (just remove the first element from the list and return a pointer to it) as is deallocation (add the block to the free list). During allocation, if the list is empty, grab a new superblock, partition and add all blocks into the list.
For the variable sized list, you can simplify it to the fixed size block by allocating only blocks of known sizes: 32 bytes, 64 bytes, 128 bytes, 512 bytes. You will have to analyze the memory usage to come up with the different buckets so that you don't waste too much memory. For large objects, you can go back to a dynamic size allocation pattern, that will be slow, but hopefully the amount of large objects is limited.

Building on Tim's answer, I would personally use something akin to BiBOP.
The basic idea is simple: use fixed-size pools.
There are some refinements to that.
First, the size of the pools is generally fixed. It depends on your allocation routine, typically if you know the OS you're working on map at least 4KB at once when you use malloc, then you use that value. For a memory mapped file, you might be able to increase this.
The advantage of fixed-size pools is that it nicely fights off fragmentation. All pages being of the same size, you can recycle an empty 256-bytes page into a 128-bytes page easily.
There is still some fragmentation for large objects, which are typically allocated outside of this system. But it's low, especially if you fit large objects into a multiple of the page size, this way the memory will be easy to recycle.
Second, how to handle the pools ? Using linked-lists.
The pages are typically untyped (by themselves), so you have a free-list of pages in which to prepare new pages and put "recycled" pages.
For each size category you then have a list of "occupied" pages, in which memory has been allocated. For each page you keep:
the allocation size (for this page)
the number of allocated objects (to check for emptiness)
a pointer to the first free cell
a pointer to the previous and next pages (might point to the "head" of the list)
Each free-cell is itself a pointer (or index, depending on the size you have) to the next free-cell.
The list of "occupied" pages of a given size is simply managed:
on deletion: if you empty the page, then remove it from the list and push it into the recycled pages, otherwise, update the free-cell list of this page (note: finding the beginning of the current page is usually a simple modulo operation on the address)
on insertion: search starting from head, as soon as you find a non-full page, move it in front of the list (if it's not already) and insert your item
This scheme is really performant memory-wise, with only a single page reserved for indexing.
For multi-threaded / multi-processes applications, you'll need to add synchronization (a mutex per page typically), in case you could get inspiration from Google's tcmalloc (try and find another page instead of blocking, use a thread-local cache to remember which page you last used).
Having said that, have you tried Boost.Interprocess ? It provides allocators.

For the fixed sizes you can easily use a small memory allocator type of allocator where you allocate a large block that's split into fixed-size chunks. You then create a vector of pointers to available chunks and pop/push as you allocate/free. This is very fast.
For variable length items, it's harder: You either have to deal with searching for available contiguous space or use some other approach. You might consider maintaining another map of all the free nodes ordered by block size, so you can lower_bound the map and if the next available node is say only 5% too big return it instead of trying to find usable available space of the exact size.

My inclination for variable-sized items would be to, if practical, avoid holding direct pointers to data and instead keep handles. Each handle would be a the index of a superblock, and an index to an item within the superblock. Each superblock would have an item-list allocated top-down and items allocated bottom-up. Each item's allocation would be preceded by its length, and by the index of the item it represents; use one bit of the index to indicate whether an item is 'pinned'.
If an item fits after the last allocated item, simply allocate it. If it would hit a pinned item, move the next-allocation mark past the pinned item, find the next higher pinned item, and try the allocation again. If the item would collide with the item-list but there's enough free space somewhere, compactify the block's contents (if one or more items are pinned, it may be better to use another superblock if one is available). Depending upon usage patterns, it may be desirable to start by only compactifying the stuff that was added since the last collection; if that doesn't provide enough space, then compactify everything.
Of course, if only have a few discrete sizes of items, you can use simple fixed-sized-chunk allocators.

I agree with Tim - use memory pools to avoid fragmentation.
However you may be able to avoid some of the churn by storing pointers rather than objects in your vectors, perhaps ptr_vector?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js