C++ heap organisation - which data structure? - c++

C++ has two main memory types: heap and stack. With stack everything is clear to me, but concerning heap a question remains:
How is heap memory organised? I read about the heap data structure, but that seems not the applicable to the heap, because in C++ I can access any data from the heap, not just a minimum or maximum value.
So my question is: how is heap memory organized in C++? When we read "is allocated on the heap" what memory organization do we have in mind?

A common (though certainly not the only) implementation of the free store is a linked list of free blocks of memory (i.e., blocks not currently in use). The heap manager will typically allocate large blocks of memory (e.g., 1 megabyte at a time) from the operating system, then split these blocks up into pieces for use by your code when you use new and such. When you use delete, the block of memory you quit using will be added as a node on that linked list.
There are various strategies for how you use those free blocks. Three common ones are best fit, worst fit and first fit.
In best fit, you try to find a free block that's closest in size to the required allocation. If it's larger than required (typically after rounding the allocation size) you split it into two pieces: one to return for the allocation, and a new (smaller) free block to put back on the list to meet other allocation requests. Although this can seem like a good strategy, it's often problematic. The problem is that when you find the closest fit, the left-over block is often too small to be of much use. After a short time, you end up with a huge number of tiny blocks of free space, none of which is good for much of anything.
Worst fit combats that by instead finding the worst fit among the free blocks -- IOW, the largest block of free space. When it splits that block, what's left over will be as large as possible, maximizing the chance that it'll be useful for other allocations.
First fist just walks through the list of free blocks, and (as you'd guess from the name) uses the first block that's large enough to meet the requirement.
Quite a few also start with a search for an exact fit, and use that in preference to splitting a block.
Quite a few also keep (for example) a number of separate linked lists for different allocation sizes to minimize searching for a block of the right size.
In quite a few cases, the manager also has some code to walk through the list of free blocks, to find any that are right next to each other. If it finds them, it will join the two small blocks into one larger bock. Sometimes this is done right when you free/delete a block. More often, it's done lazily, to avoid joining then re-splitting blocks when/if you're using a lot of the same size of blocks (which is fairly common).
Another possibility that's common when dealing with a large number of identically-sized items (especially small ones) is an array of blocks, with a bitset to specify which are free or in use. In this case, you typically keep track of an index into the bitset where the last free block was found. When a block is needed, just search forward from the last index until you find the next one for which the bitset says the block is free.

Heap has two main meanings with two different concepts:
A data structure for arrange data, a tree.
A pool of available memory. It manages memory by allocate/free available memory blocks.
An introduction about Heap in Memory Management:
The heap is the other dynamic memory area, allocated/freed by
malloc/free and their variants....

Here heap doesn't mean the heap data structure. That memory is termed as heap memory where global, static variables are stored. Also when we allocate memory dynamically, it is allocated in the heap memory.

When you read "allocated on the heap", that's usually an implementation-specific implementation of C++'s "dynamic storage duration". It means you've used new to allocate memory, and have a pointer to it that you now have to keep track of til you delete (or delete[]) it.
As for how it's organized, there's no one set way of doing so. At one point, if memory serves, the "heap" used to actually be a min-heap (organized by block size). That's implementation specific, though, and doesn't have to be that way. Any data structure will do, some better than others for given circumstances.

Related

How C++ heap managers track the size of allocated objects?

I want to estimate memory consumption in C++ when I allocate objects in the heap. I can start my calculation with sizeof(object) and round it up to the nearest multiple of the heap block (typically 8 bytes). But if the whole allocated block goes to the allocated object, how can the heap manager tell the size of the object when I ask it to delete the pointer?
If the heap manager tracks the size of each object, does it mean that I should add ~4 bytes per allocated object in my calculation to the total memory consumption for heap manager internal expenses? Or does it store this information in much more compact form? What are extra costs (memory-wise) of heap memory allocation?
I understand that my question is very implementation specific, but I appreciate any hints about the heap metadata storage of major implementations such as gcc (or maybe it is about libc).
Heap allocators are not free. There is the cost per block of allocation (both in size and perhaps search if using a find best algorithm), the cost of joining blocks upon free, and any lost size per block when the size requested is smaller than the block size returned. There is also a cost in terms of memory fragmentation. Consider placing a small 1 byte allocation in the middle of your heap. At this point, you can no longer return a contiguous block larger than 1/2 the heap - half your heap is fragmented. Good allocators fight all the above and try to maximize all benefits.
Consider the following allocation system (used in many real world applications over a decade on numerous handheld game devices.)
Create a primary heap where each allocation has a prev ptr, next ptr, size, and possibly other info. Round it to 16 bytes per entry. Store this info before or after the actual memory pointer being returned - your choice as there are advantages to each. Yes, you are allocating requested size + 16 bytes here.
Now just keep a pointer to a free list and possibly a used list.
Allocation is done by finding a block on the free list big enough for our use and splitting it into the size requested and the remainder (first fit), or by searching the whole list for as exact a match as possible (best fit). Simple enough.
Freeing is moving the current item back into the free list, joining areas adjacent to each other if possible. You can see how this can go to O(n).
For smaller allocations, get a single allocation (either from newly created heap, or from global memory) that will be your unit allocation zone. Split this area up into "block size" chunk addresses and push these addresses onto a free stack. Allocation is popping an address from this list. Freeing is just pushing the allocation back to the list - both O(1).
Then inside your malloc/new/etc, check size if it is within the unit size, allocate from the unit allocator, otherwise use the O(n) allocator. My studies have shown that you can get 90-95% of allocations to fit within the unit allocator's block size without too much issue.
Also, you can allocate slabs of memory for memory pools and leave them allocated while using them over and over. A few larger allocations are much cheaper to manage (Unix systems use this alot...)
Advantages:
All unit allocations have no external fragmentation.
Small allocations run constant time.
Larger allocation can be tailored to the exact size of the data requested.
Disadvantages:
You pay an internal fragmentation cost when you want memory smaller than the block requsted.
Many, many allocations and frees outside your small block size will eventually fragment your memory. This can lead to all sorts of unpleasantness. This can be managed with OS help or with care on alloc/free order.
Once you get 1000s of large allocations going, there is a CPU price. Slabs / multiple heaps can be used to fix this as well.
There are many, many schemes out there, but this one is simple and has been used in commercial apps for ages though, so I wanted to start here.

The problems of basing an allocator on static memory

I've recently come up with an idea of pooling memory in a C++ program with a statically initialized array (ie. "static byte Memory[Size]"). The process of allocating blocks of memory would be similar to calling malloc up front for the pool. I then built this allocator and there seems to be no problems. Is there any issues with this allocator architecture other than limited memory space?
Note: The way I reserve blocks of memory is by creating a linked list of blocks that store size, pointer and neighbors. Not that it is important to the question.
TI's sysbios actually provides an implementation for their microcontrollers that looks similar to what you describe. They call this implementation heapbuf
The heapbuf is a single array that has been split up in n equally sized blocks. The heapbuf also holds the pointer to the first empty block, and every empty block holds the pointer to the next empty element.
This specific implementation is only able to allocate blocks of exactly that size, however has no fragmentation issues. If you have a lot of small objects of roughly the same size, it should offer great performance (as allocation is of constant cost and only requires you to update the pointer to the first empty block).
Your note however, is important to the question. And the issue with blocks of different sizes is as follows:
The first block might not fit, so you need to do a list traversal until you find a large enough open space.
After you have allocated all your memory in blocks of 1k, you decide to free them all and now you have a linked list of many free 1k blocks, have you decided on how to and when to merge them?
The developers of freeRTOS also ran into these issues and therefore provide multiple heaps that take different compromises (eg, fast, but no freeing at all, or a bit slower but no merging of freed blocks, or slow but merging of freed blocks). There is an overview of their implementations here: http://www.freertos.org/a00111.html

Is there any benefit to use multiple heaps for memory management purposes?

I am a student of a system software faculty. Now I'm developing a memory manager for Windows. Here's my simple implementation of malloc() and free():
HANDLE heap = HeapCreate(0, 0, 0);
void* hmalloc(size_t size)
{
return HeapAlloc(heap, 0, size);
}
void hfree(void* memory)
{
HeapFree(heap, 0, memory);
}
int main()
{
int* ptr1 = (int*)hmalloc(100*sizeof(int));
int* ptr2 = (int*)hmalloc(100*sizeof(int));
int* ptr3 = (int*)hmalloc(100*sizeof(int));
hfree(ptr2);
hfree(ptr3);
hfree(ptr1);
return 0;
}
It works fine. But I can't understand is there a reason to use multiple heaps? Well, I can allocate memory in the heap and get the address to an allocated memory chunk. But here I use ONE heap. Is there a reason to use multiple heaps? Maybe for multi-threaded/multi-process applications? Please explain.
The main reason for using multiple heaps/custom allocators are for better memory control. Usually after lots of new/delete's the memory can get fragmented and loose performance for the application (also the app will consume more memory). Using the memory in a more controlled environment can reduce heap fragmentation.
Also another usage is for preventing memory leaks in the application, you could just free the entire heap you allocated and you don't need to bother with freeing all the object allocated there.
Another usage is for tightly allocated objects, if you have for example a list then you could allocate all the nodes in a smaller dedicated heap and the app will gain performance because there will be less cache misses when iterating the nodes.
Edit: memory management is however a hard topic and in some cases it is not done right. Andrei Alexandrescu had a talk at one point and he said that for some application replacing the custom allocator with the default one increased the performance of the application.
This is a good link that elaborates on why you may need multiple heap:
https://caligari.dartmouth.edu/doc/ibmcxx/en_US/doc/libref/concepts/cumemmng.htm
"Why Use Multiple Heaps?
Using a single runtime heap is fine for most programs. However, using multiple
heaps can be more efficient and can help you improve your program's performance
and reduce wasted memory for a number of reasons:
1- When you allocate from a single heap, you may end up with memory blocks on
different pages of memory. For example, you might have a linked list that
allocates memory each time you add a node to the list. If you allocate memory for
other data in between adding nodes, the memory blocks for the nodes could end up
on many different pages. To access the data in the list, the system may have to
swap many pages, which can significantly slow your program.
With multiple heaps, you can specify which heap you allocate from. For example,
you might create a heap specifically for the linked list. The list's memory blocks
and the data they contain would remain close together on fewer pages, reducing the
amount of swapping required.
2- In multithread applications, only one thread can access the heap at a time to
ensure memory is safely allocated and freed. For example, say thread 1 is
allocating memory, and thread 2 has a call to free. Thread 2 must wait until
thread 1 has finished its allocation before it can access the heap. Again, this
can slow down performance, especially if your program does a lot of memory
operations.
If you create a separate heap for each thread, you can allocate from them
concurrently, eliminating both the waiting period and the overhead required to
serialize access to the heap.
3- With a single heap, you must explicitly free each block that you allocate. If you
have a linked list that allocates memory for each node, you have to traverse the
entire list and free each block individually, which can take some time.
If you create a separate heap for that linked list, you can destroy it with a
single call and free all the memory at once.
4- When you have only one heap, all components share it (including the IBM C and
C++ Compilers runtime library, vendor libraries, and your own code). If one
component corrupts the heap, another component might fail. You may have trouble
discovering the cause of the problem and where the heap was damaged.
With multiple heaps, you can create a separate heap for each component, so if
one damages the heap (for example, by using a freed pointer), the others can
continue unaffected. You also know where to look to correct the problem."
A reason would be the scenario that you need to execute a program internally e.g. running simulation code. By creating your own heap you could allow that heap to have execution rights which by default for security reasons is turned off. (Windows)
You have some good thoughts and this'd work for C but in C++ you have destructors, it is VERY important they run.
You can think of all types as having constructors/destructors, just that logically "do nothing".
This is about allocators. See "The buddy algorithm" which uses powers of two to align and re-use stuff.
If I allocate 4 bytes somewhere, my allocator might allocate a 4kb section just for 4 byte allocations. That way I can fit 1024 4 byte things in the block, if I need more add another block and so forth.
Ask it for 4kb and it wont allocate that in the 4byte block, it might have a separate one for larger requests.
This means you can keep big things together. If I go 17 bytes then 13 bytes the 1 byte and the 13byte gets freed, I can only stick something in there of <=13 bytes.
Hence the buddy system and powers of 2, easy to do using lshifts, if I want a 2.5kb block, I allocate it as the smallest power of 2 that'll fit (4kb in this case) that way I can use the slot afterwards for <=4kb items.
This is not for garbage collection, this is just keeping things more compact and neat, using your own allocator can stop calls to the OS (depending on the default implementation of new and delete they might already do this for your compiler) and make new/delete very quick.
Heap-compacting is very different, you need a list of every pointer that points to your heap, or some way to traverse the entire memory graph (like spits Java) so when you move stuff round and "compact" it you can update everything that pointed to that thing to where it currently is.
The only time I ever used more than one heap was when I wrote a program that would build a complicated data structure. It would have been non-trivial to free the data structure by walking through it and freeing the individual nodes, but luckily for me the program only needed the data structure temporarily (while it performed a particular operation), so I used a separate heap for the data structure so that when I no longer needed it, I could free it with one call to HeapDestroy.

suggestions for improving an allocator algorithm implementation

I have a Visual Studio 2008 C++ application where I'm using a custom allocator for standard containers such that their memory comes from a Memory Mapped File rather than the heap. This allocator is used for 4 different use cases:
104-byte fixed size structure std::vector< SomeType, MyAllocator< SomeType > > foo;
200-byte fixed size structure
304-byte fixed size structure
n-byte strings std::basic_string< char, std::char_traits< char >, MyAllocator< char > > strn;
I need to be able to allocate roughly 32MB total for each of these.
The allocator tracks memory usage using a std::map of pointers to allocation size. typedef std::map< void*, size_t > SuperBlock; Each SuperBlock represents 4MB of memory.
There is a std::vector< SuperBlock > of these in case one SuperBlock isn't enough space.
The algorithm used for the allocator goes like this:
For each SuperBlock: Is there space at the end of the SuperBlock? put the allocation there. (fast)
If not, search within each SuperBlock for an empty space of sufficient size and put the allocation there. (slow)
Still nothing? allocate another SuperBlock and put the allocation at the start of the new SuperBlock.
Unfortunately, step 2 can become VERY slow after a while. As copies of objects are made and temporary variables destroyed I get a lot of fragmentation. This causes a lot of deep searching within the memory structure. Fragmentation is in issue as I have a limited amount of memory to work with (see note below)
Can anybody suggest improvements to this algorithm that would speed up the process? Do I need two separate algorithms (1 for the fixed-size allocations and one for the string allocator)?
Note: For those that need a reason: I'm using this algorithm in Windows Mobile where there's a 32MB process slot limit to the Heap. So, the usual std::allocator won't cut it. I need to put the allocations in the 1GB Large Memory Area to have enough space and that's what this does.
Can you have a separate memory allocation pool for each different fixed-size type you are allocating? That way there won't be any fragmentation, because the allocated objects will always align on n-byte boundaries. That doesn't help for the variable-length strings, of course.
There's an example of small-object allocation in Alexandrescu's Modern C++ design that illustrates this principle and may give you some ideas.
For the fixed sized objects, you can create a fixed sized allocator. Basically you allocate blocks, partition into subblocks of the appropriate size and create a linked list with the result. Allocating from such a block is O(1) if there is memory available (just remove the first element from the list and return a pointer to it) as is deallocation (add the block to the free list). During allocation, if the list is empty, grab a new superblock, partition and add all blocks into the list.
For the variable sized list, you can simplify it to the fixed size block by allocating only blocks of known sizes: 32 bytes, 64 bytes, 128 bytes, 512 bytes. You will have to analyze the memory usage to come up with the different buckets so that you don't waste too much memory. For large objects, you can go back to a dynamic size allocation pattern, that will be slow, but hopefully the amount of large objects is limited.
Building on Tim's answer, I would personally use something akin to BiBOP.
The basic idea is simple: use fixed-size pools.
There are some refinements to that.
First, the size of the pools is generally fixed. It depends on your allocation routine, typically if you know the OS you're working on map at least 4KB at once when you use malloc, then you use that value. For a memory mapped file, you might be able to increase this.
The advantage of fixed-size pools is that it nicely fights off fragmentation. All pages being of the same size, you can recycle an empty 256-bytes page into a 128-bytes page easily.
There is still some fragmentation for large objects, which are typically allocated outside of this system. But it's low, especially if you fit large objects into a multiple of the page size, this way the memory will be easy to recycle.
Second, how to handle the pools ? Using linked-lists.
The pages are typically untyped (by themselves), so you have a free-list of pages in which to prepare new pages and put "recycled" pages.
For each size category you then have a list of "occupied" pages, in which memory has been allocated. For each page you keep:
the allocation size (for this page)
the number of allocated objects (to check for emptiness)
a pointer to the first free cell
a pointer to the previous and next pages (might point to the "head" of the list)
Each free-cell is itself a pointer (or index, depending on the size you have) to the next free-cell.
The list of "occupied" pages of a given size is simply managed:
on deletion: if you empty the page, then remove it from the list and push it into the recycled pages, otherwise, update the free-cell list of this page (note: finding the beginning of the current page is usually a simple modulo operation on the address)
on insertion: search starting from head, as soon as you find a non-full page, move it in front of the list (if it's not already) and insert your item
This scheme is really performant memory-wise, with only a single page reserved for indexing.
For multi-threaded / multi-processes applications, you'll need to add synchronization (a mutex per page typically), in case you could get inspiration from Google's tcmalloc (try and find another page instead of blocking, use a thread-local cache to remember which page you last used).
Having said that, have you tried Boost.Interprocess ? It provides allocators.
For the fixed sizes you can easily use a small memory allocator type of allocator where you allocate a large block that's split into fixed-size chunks. You then create a vector of pointers to available chunks and pop/push as you allocate/free. This is very fast.
For variable length items, it's harder: You either have to deal with searching for available contiguous space or use some other approach. You might consider maintaining another map of all the free nodes ordered by block size, so you can lower_bound the map and if the next available node is say only 5% too big return it instead of trying to find usable available space of the exact size.
My inclination for variable-sized items would be to, if practical, avoid holding direct pointers to data and instead keep handles. Each handle would be a the index of a superblock, and an index to an item within the superblock. Each superblock would have an item-list allocated top-down and items allocated bottom-up. Each item's allocation would be preceded by its length, and by the index of the item it represents; use one bit of the index to indicate whether an item is 'pinned'.
If an item fits after the last allocated item, simply allocate it. If it would hit a pinned item, move the next-allocation mark past the pinned item, find the next higher pinned item, and try the allocation again. If the item would collide with the item-list but there's enough free space somewhere, compactify the block's contents (if one or more items are pinned, it may be better to use another superblock if one is available). Depending upon usage patterns, it may be desirable to start by only compactifying the stuff that was added since the last collection; if that doesn't provide enough space, then compactify everything.
Of course, if only have a few discrete sizes of items, you can use simple fixed-sized-chunk allocators.
I agree with Tim - use memory pools to avoid fragmentation.
However you may be able to avoid some of the churn by storing pointers rather than objects in your vectors, perhaps ptr_vector?

Proper layout for an an mempool/memory allocator? (which algorithm)

Hello, I'm thinking about trying to expand my skills by trying some things I've never done before. One thing that has always mystified me a bit is memory allocators and memory pools. What I want to do is take a block of memory and only allocate the memory from the system once. I have that currently set up, the memory is an array of bytes (or chars) that is 65535 for my testing purpose.
I have two algorithms that I've considered using.
First is an algorithm where the entire block of data is appended with the amount of memory left over, and a pointer (or rather an offset) to the first allocated block (or the header of the block), then each allocation is preceded by the size of the allocation, and the previous and next allocation, so I can release the allocation easily. I then can generate the largest and smallest blocks by looking at the space after the current allocations.
The other option I have is to add an second offset before the allocated memory and make that point to the first unallocated block, then each unallocated block also has the Previous and Next allocation, along with a size so that I can easily find a place where my next allocation can be placed.
The issue is I can't tell which is "proper". Assume we'll have variable size allocations (but most will be at a size that this isn't that much overhead.) The first will be slower to get the largest and smallest possible blocks, but I can store those and manipulate them if necessary to avoid regenerating them. However the second would take a longer time to deallocate (due to having to find which deallocator is next to which allocator) and not necessarily give any benefits to allocation. In fact it will require more specialized code for the case where I have less than 6 bytes left over (2 bytes for size, 2 for prev offset, 2 for next offset).
My gut tells me the first would be far superior, but something about the second is enticing. Any opinions? Or is there a far easier solution?
The problem with this type of approach is handling fragmentation (and hence how to coalesce adjoining free chunks of memory). Look at this question: Designing and coding a non-fragmentizing static memory pool, discusses a different approach...