std::map allocation node packing? - c++

I have noticed that the std::map implementation of Visual Studio (2010) allocates a new single block of memory for each node in its red-black tree. That is, for each element in the map a single new block of raw memory will be allocated via operator new ... malloc with the default allocation scheme of the std::map of the Visual Studio STL implementation.
This appears a bit wasteful to me: Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth?
So I'd like the following points clarified:
Is my assertion about the default allocation scheme actually correct?
Do "all" STL implementations of std::map work this way?
Is there anything in the std preventing a std::map implementation from putting its nodes into blocks of memory instead of allocation a new block of memory (via its allocator) for each node? (Complexity guarantees, etc.)?
Note: This is not about premature optimization. If its about optimization, then its about if an app has problem with (std::)map memory fragmentation, are there alternatives to using a custom allocator that uses a memory pool? This question is not about custom allocators but about how the map implementation uses its allocator. (Or so I hope it is.)

Your assertion is correct for most implementations of std::map.
To my knowledge, there is nothing in the standard preventing a map from using an allocation scheme such as you describe. However, you can get what you describe with a custom allocator — but forcing that scheme on all maps could be wasteful. Because map has no a priori knowledge of how it will be used, certain use patterns could prevent deallocations of mostly-unused blocks. For example, say blocks were allocated for 4 nodes at a time, but a particular map is filled with 40 nodes, then 30 nodes erased, leaving a worst case of one node left per block as map cannot invalidate pointers/references/iterators to that last node.

When you insert elements into a map, it's guaranteed that existing iterators won't be invalidated. Therefore, if you insert an element "B" between two nodes A and C that happen to be contiguous and inside the same heap allocated area, you can't shuffle them to make space, and B will have to be put elsewhere. I don't see any particular problem with that, except that managing such complexities will swell the implementation. If you erase elements then iterators can't be invalidated either, which implies any memory allocation has to hang around until all the nodes therein are erased. You'd probably need a freelist within each "swollen node"/vector/whatever-you-want-to-call-it - effectively duplicating at least some of the time-consuming operations that new/delete currently do for you.

I'm quite certain I've never seen an implementation of std::map that attempted to coalesce multiple nodes into a single allocation block. At least right offhand I can't think of a reason it couldn't work, but I think most implementors would see it as unnecessary, and leave optimization of memory allocation to the allocator instead of worrying about it much in the map itself.
Admittedly, most custom allocators are written to deal better with allocation of a large number of small blocks. You could probably render the vast majority of such optimization unnecessary by writing map (and, of course, set, multiset, and multimap) to just use larger allocations instead. OTOH, given that allocators to optimize small block allocations are easily/common/widely available, there's probably not a lot of motivation to change the map implementation this way either.

I think the only thing you cannot do is to invalidate iterators, which you might have to do if you have to reallocate your storage. Having said that, I've seen implementations using single sorted array of objects wrapped in the std::map interface. That was done for a certain reason, of course.
Actually, what you can do is just instantiate your std::map with you custom allocator, which will find memory for new nodes in a special, non-wasteful way.

This appears a bit wasteful to me. Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth
Interestingly I see it in a completely different way. I find it is appropriate and it doesn't waste any memory. At least with defaul STL allocators on Windows (MS VS 2008), HP-UX (gcc with STLport) and Linux (gcc without STLport). What is important is that these allocators do care about memory fragmentation and it seems they can handle this question pretty well. For example, look for Low-fragmentation Heap on Windows or SBA (Small block allocator) on HP-UX. I mean that frequently allocating and deallocating memory only for one node at a time doesn't have to result in memory fragmentation. I tested std::map myself in one of my programs and it indeed didn't cause any memory fragmentation with these allocators.
Is my assertion about the default
allocation scheme actually correct?
I have MS VisualStudio 2008 and its std::map behaves in the same way. On HP-UX I use gcc with and without STLport and it seem that their STL maps have the same approach to allocating memory for nodes in the std::map.
is there anything in the std
preventing a std::map implementation
from putting its nodes into blocks of
memory instead of allocation a new
block of memory (via its allocator)
for each node?
Start with tuning a default allocator on your platform if it is possible. It is useful here to quote the Douglas Lea who is an author of DL-Malloc
... first I wrote a number of
special-purpose allocators in C++,
normally by overloading operator new
for various classes. ...
However, I soon realized that building
a special allocator for each new class
that tended to be dynamically
allocated and heavily used was not a
good strategy when building kinds of
general-purpose programming support
classes I was writing at the time.
(From 1986 to 1991, I was the the
primary author of libg++ , the GNU C++
library.) A broader solution was
needed -- to write an allocator that
was good enough under normal C++ and C
loads so that programmers would not be
tempted to write special-purpose
allocators except under very special
conditions.
Or as a little bit more difficult idea you can even try to test your application with Hoard allocator. I mean just test your application and see if there is any benefit as for performance or fragmentation.

Related

What are the differences between Block, Stack and Scratch Allocators?

In his talk "Solving the Right Problems for Engine Developers", Mike Acton says that
the vast majority of the time, all you're going to need are these three types of allocator: there's the block allocator, the stack allocator and the scratch allocator
However, he doesn't go into detail about what the differences between these types of allocator are.
I would presume a 'stack allocator' is just a stack-based allocator, but all the other types I've heard of (including 'arena') just sound like fancy ways of doing the same thing, that is 'allocate a big block and chunk it up in a nice efficient way, then free it when you're done'
So, what are the differences between these allocators, what are the advantages of each, why do I only need these three 'the vast majority of the time'?
As was pointed out in the comments, the terminology used in the talk is not well established around the industry, so there is some doubt left as to what exact allocation strategies are being referred to here. Taking into account what is commonly mentioned in game programming literature, here is my educated guess what is behind the three mentioned allocators:
Block Allocator
Also known as a pool allocator. This is an allocator that only hands out fized-sized blocks of memory, regardless of how much memory the user actually requested.
Let's say you have a block allocator with a block size of 100 bytes. You want to allocate memory for a single 64 bit integer? It gives you a block of 100 bytes. You want to allocate memory for an array of 20 single precision floats? It gives you a block of 100 bytes. You want to allocate memory for an ASCII string with 101 characters? It gives you an error, as it can't fit your string into 100 bytes.
Block allocators have several advantages. They are relatively easy to implement and they don't suffer from external memory fragmentation. They also usually exhibit a very predictable runtime behavior, which is often essential for video games. They are well suited for problems where most allocations are of roughly the same size and obviously less well suited for when that is not the case.
Apart from the simplest version described here, where each allocator supports only a single block size, extensions exist that are more flexible, supporting multiple block sizes, without compromising too heavily on the aforementioned advantages.
Stack Allocator
A stack allocator works like a stack: You can only deallocate in the inverse order of allocation. If you subsequently allocate objects A and then B, you cannot reclaim the memory for A without also giving up B.
Stack allocators are very easy to implement, as you only need to keep track of a single pointer that marks the separation between the used and unused regions of memory. Allocation moves that pointer into one direction and deallocation moves it the opposite way.
Stack allocators make optimally efficient use of memory and have fully predictable runtime behavior. They obviously work well only for problems where the required order of deallocations is easy to achieve. It is usually not trivial to enforce the correct deallocation order statically, so debugging them can be a pain if they are being used carelessly.
Scratch Allocator
Also known as a monotonic allocator. A scratch allocator works similar to a stack allocator. Allocation works exactly the same. Deallocation is a no-op. That is, once memory has been allocated it cannot be reclaimed.
If you want to get the memory back, you have to destroy the entire scratch allocator, thereby releasing all of its memory at once.
The advantages of the scratch allocator are the same as with the stack allocator. They are well suited for problems where you can naturally identify points at which all allocated objects are no longer needed. Similar to the stack allocator, when used carelessly, they can lead to nasty runtime errors if an allocator is destroyed while there are still active objects alive.
Why do I only need those three?
Experience shows that in a lot of domains, fully dynamic memory management is not required. Allocation lifetimes can either be grouped by common size (block allocator) or by common lifetimes (scratch and stack allocator). If an engineer working in such a domain is willing to go through the troubles of classifying each allocation accordingly, they can probably make due with just these three allocation strategies for the majority of their dynamic memory needs, without introducing unreasonable additional development efforts. As a reward for their efforts, they will benefit from the nice runtime properties of these algorithms, in particular very fast and predictable execution times, and predictable memory consumption.
If you are in a domain where it is harder to classify allocations along those terms; or if you can not or are unwilling to spend the additional engineering effort; or if you are dealing with a special use case that doesn't map well to those three allocators - you will probably still want to use a general purpose allocator, i.e. good old malloc.
The point that was being made in the talk is more that if you do need to worry about custom memory allocation - and especially in the domain of video games with its specific requirements and trade offs - those three types of allocators are very good answers to the specific problems that you may otherwise encounter when naïvely relying on the general purpose allocator alone.
I gave a long talk about allocators in C++ a while back where I explain all this in more detail if you still want to know more.
Allocator
An allocator in c++ can define a set of functions for allocating and deallocating memory dynamically. There are severals contaianers such as vector,lis,deque. The allocator is used by the containers. The std::allocator class template is used as a default allocator for most containers.
The are several allocators are given below
std::allocator
std::pmr::polymorphic_allocator
pool allocator/block allocator
stack allocator
scratch alocator
Block Allocator
Block allocator is an allcator that can manage a pool of a memory and
can allocate memory block from the pool. Pool allocators are useful for applications where memory allocation and deallocation are performance-critical and where the size of the memory blocks being allocated is known ahead of time.
Stack Allocator
Stack allocator working procedure is just like stack data structure. Stack data structure follows LIFO meaning that last in first out. We use stack structure to store data in memory in the last index one after another. And when we need to remove data from stack then we remove data from last index instead of other index.
We can use the stack allocator at the same process as stack data structure. Suppose a container x is allocated and when needed to add another container y then the container is added in the last of x. Similarly when needed to remove a container the at first the container y will be removed instead of first container.
Scratch Allocator
A scratch allocator is a type of memory allocator that can be used to manage temporary memory that is needed for the duration of a specific function or task.

Does std::unordered_map::erase actually perform dynamic deallocation?

It isn't difficult to find information on the big-O time behavior of stl container operations. However, we operate in a hard real-time environment, and I'm having a lot more trouble finding information on their heap memory usage behavior.
In particular I had a developer come to me asking about std::unordered_map. We're allowed to be non-realtime at startup, so he was hoping to perform a .reserve() at startup time. However, he's finding he gets overruns at runtime. The operations he uses are lookups, insertions, and deletions with .erase().
I'm a little worried about that .reserve() actually preventing later runtime memory allocations (I don't really understand the explanation of what it does wrt to heap usage), but .erase() in particular I don't see any guarantee whatsoever that it won't be asking the heap for a dynamic deallocation when called.
So the question is what's the specified heap interactions (if any) for std::unordered_map::erase, and if it actually does deallocations, if there's some kind of trick that can be used to avoid them?
The standard doesn't specify container allocation patterns per-se. These are effectively derived from iterator/reference invalidation rules. For example, vector::insert only invalidates all references if the number of elements inserted causes the size of the container to exceed its capacity. Which means reallocation happened.
By contrast, the only operations on unordered_map which invalidates references are those which actually remove that particular element. Even a rehash (which likely allocates memory) does not invalidate references (this is why reserve changes nothing).
This means that each element must be stored separately from the hash table itself. They are individual nodes (which is why it has a node_type extraction interface), and must be able to be allocated and deallocated individually.
So it is reasonable to assume that each insertion or erasure represents at least one allocation/deallocation.
If you're all right with nodes continuing to consume memory, even after they've been removed from the container, you could pretty easily write an Allocator class that basically made deallocation a NOP.
Quite a few real-time systems basically allocate all the memory they're going to use up-front, then once they've finished initialization they neither allocate nor release memory. This would allow you to do pretty much the same thing with an unordered_map.
That said, I'm somewhat skeptical about the benefit in this case. The main strength of unordered_map is supporting insertion and deletion that are usually fast. If you're not going to be doing insertion at runtime, chances are pretty good that it's not a particularly great choice.
If it's a collection that's mostly filled during initialization, then used mostly as-is, with a few items being "removed", but no more being inserted after you finish initialization, you're likely to be better off with a simple sorted array and an interpolating search (or, if the data is distributed extremely unpredictably, maybe a binary search--but an interpolating search is usually better). In this case, I'd handle removal by simply adding a boolean to each item saying whether that item is valid or not. Erase by setting that value to false. If you find such a value during a search, you basically just ignore it.

Preallocate memory for dynamic data structure

I have a question/curiosity.
Let's say I want to implement a list, and for example I could basically use the cormen book approach. Where it is explained how to implement, insert, delete, key search etc.
However nothing is said for what the memory use is concerned. For example if I would like to insert an integer, in a list of integers. I could for example first create a node (I allocate memory there) insert the integer and then insert the node in the list. If I would like to delete an integers, once I know in which node is stored, I have to free the memory.
I was now wondering if instead it would be more convenient to preallocate memory to store, say, 10 nodes and keeping a pointer to a free node to be used. If the memory pool is full then I reallocate memory for 20 nodes, if the pool is the large I half the size of such pool (and so on and so forth). The pool is of course more complicated to manage since I'd need for example to handle possible memory fragmentation etc.
Does what I'm saying make any sense? Or is it no sense? I've read in a book, for game programming, that memory preallocation could improve performance, but I was wondering how.
This is both a simple and a complex question. If you operate within standard problems, you don't really need to worry about memory allocation. For example, preallocating memory for 10 nodes won't be efficient in any scale, and your performance problems might be elsewhere. However, if your program constantly allocates and deallocates hundreds or thousands of small objects per second, it could lead to memory fragmentation, and you might need to write your custom allocator.
Almost no standard containers don't have any methods to preallocate elements storage, except for std::vector::reserve function. All of them, however, allow to use custom allocators in constructors. Also, there's placement new operator.
You could try to experiment with such things, they're fun to write, just don't use them in production if you absolutely don't have to.
I was now wondering if instead it would be more convenient to preallocate memory to store, say, 10 nodes and keeping a pointer to a free node to be used.
You basically are describing what a pool allocator usually does (I assume you are talking about nodes of constant size). So, the short answer to your question is: yes you would improve performance by using a pool allocator with a list container.
Memory allocators shipped with common compilers are quite good for general purpose allocation (i.e. for allocation of random size objects). However, when your need is to allocate objects of constant size, you should consider using a custom pool allocator. You can easily understand why a constant size objects allocator performs faster than the standard one.
You might write your own pool allocator, however it's not an easy task and you should better consider using an existing one, such as boost pool_allocator or fast_pool_allocator.

Preventing memory freeing in STL Container

I have an STL container (std::list) that I am constantly reusing. By this I mean I
push a number of elements into the container
remove the elements during processing
clear the container
rinse and repeat a large number of times
When profiling using callgrind I am seeing a large number of calls to new (malloc) and delete (free) which can be quite expensive. I am therefore looking for some way to preferably preallocate a reasonably large number of elements. I would also like my allocation pool to continue to increase until a high water mark is reach and for the allocation pool to continue to hang onto the memory until the container itself is deleted.
Unfortunately the standard allocator continually resizes the memory pool so I am looking for some allocator that will do the above without me having to write my own.
Does such an allocator exist and where can I find such an allocator?
I am working on both Linux using GCC and Android using the STLPort.
Edit: Placement new is ok, what I want to minimize is heap walking which is expensive. I would also like all my object to be as close to eachother as possible to minimize cache misses.
It sounds like you may be just using the wrong kind of container: With a list, each element occupies a separate chunk of memory, to allow individual inserts/deletes - so every addition/deletion form the list will require a separate new()/delete().
If you can use a std::vector instead, then you can reserve the required size before adding the items.
Also for deletion, it's usually best not to remove the items individually. Just call clear() on the container to empty. it.
Edit: You've now made it clear in the comments that your 'remove the elements during processing' step is removing elements from the middle of the list and must not invalidate iterators, so switching to a vector is not suitable. I'll leave this answer for now (for the sake of the comment thread!)
The allocator boost::fast_pool_allocator is designed for use with std::list.
The documentation claims that "If you are seriously concerned about performance, use boost::fast_pool_allocator when dealing with containers such as std::list, and use boost::pool_allocator when dealing with containers such as std::vector."
Note that boost::fast_pool_allocator is a singleton and by default it never frees allocated memory. However, it is implemented using boost::singleton_pool and you can make it free memory by calling the static functions boost::singleton_pool::release_memory() and boost::singleton_pool::purge_memory().
You can try and benchmark your app with http://goog-perftools.sourceforge.net/doc/tcmalloc.html, I've seen some good improvements in some of my projects (no numbers at hand though, sorry)
EDIT: Seems the code/download has been moved there: http://code.google.com/p/gperftools/?redir=1
Comment was too short so I will post my thoughts as an answer.
IMO, new/delete can come only from two places in this context.
I believe std::list<T> is implemented with some kind of nodes as normally lists are, for various reasons. Therefore, each insertion and removal of an element will have to result in new/delete of a node. Moreover, if the object of type T has any allocations and deallocations in c'tor/d'tor, they will be called as well.
You can avoid recreation of standard stuff by reiterating over existing nodes instead of deleting them. You can use std::vector and std::vector::reserve or std::array if you want to squeeze it to c-level.
Nonetheless, for every object created there must be called a destructor. The only way I see to avoid creations and destructions is to use T::operator= when reiterating over container, or maybe some c++13 move stuff if its suitable in your case.

Designing and coding a non-fragmentizing static memory pool

I have heard the term before and I would like to know how to design and code one.
Should I use the STL allocator if available?
How can it be done on devices with no OS?
What are the tradeoffs between using it and using the regular compiler implemented malloc/new?
I would suggest that you should know that you need a non-fragmenting memory allocator before you put much effort into writing your own. The one provided by the std library is usually sufficient.
If you need one, the general idea of reducing fragmentation is to grab large blocks of memory at once and allocate from the pool rather than asking the OS to provide you with heap memory sporadically and at highly varying places within the heap and interspersed with many other objects with varying sizes. Since the author of the specialized memory allocator has more knowledge on the size of the objects allocated from the pool and how those allocations occur, the allocator can use the memory more efficiently than a general purpose allocator such as the one provided by the STL.
You can look at memory allocators such as Hoard which while reducing memory fragmentation, also can increase performance by providing thread specific heaps which reduce contention. This can help your application scale more linearly, especially on multi-core platforms.
More info on multi-threaded allocators can be found here.
Will try to describe what is essentially a memory pool - I'm just typing this off the top of my head, been a while since I've implemented one, if something is obviously stupid, it's just a suggestion! :)
1.
To reduce fragmentation, you need to create a memory pool that is specific to the type of object you are allocating in it. Essentially, you then restrict the size of each allocation to the size of the object you are interested in. You could implement a templated class which has a list of dynamically allocated blocks (the reason for the list being that you can grow the amount of space available). Each dynamically allocated block would essentially be an array of T.
You would then have a "free" list, which is a singly linked list, where the head points to the next available block. Allocation is then simply returning the head. You could overlay the linked list in the block itself, i.e. each "block" (which represents the aligned size of T), would essentially be a union of T and a node in the linked list, when allocated, it's T, when freed, a node in the list. !!There are obvious dangers!! Alternatively, you could allocate a separate (and protected block, which adds more overhead) to hold an array of addresses in the block.
Allocating is trivial, iterate through the list of blocks and allocate from first available, freeing is also trivial, the additional check you have to do is the find the block from which this is allocated and then update the head pointer. (note, you'll need to use either placement new or override the operator new/delete in T - there are ways around this, google is your friend)
The "static" I believe implies a singleton memory pool for all objects of type T. The downside is that for each T you have to have a separate memory pool. You could be smart, and have a single object that manages pools of different size (using an array of pointers to pool objects where the index is the size of the object for example).
The whole point of the previous paragraph is to outline exactly how complex this is, and like RC says above, be sure you need it before you do it - as it is likely to introduce more pain than may be necessary!
2.
If the STL allocator meets your needs, use it, it's designed by some very smart people who know what they are doing - however it is for the generic case and if you know how your objects are allocated, you could make the above perform faster.
3.
You need to be able to allocate memory somehow (hardware support or some sort of HAL - whatever) - else I'm not sure how your program would work?
4.
The regular malloc/new does a lot more stuff under the covers (google is your friend, my answer is already an essay!) The simple allocator I describe above isn't re-entrant, of course you could wrap it with a mutex to provide a bit of cover, and even then, I would hazard that the simple allocator would perform orders of magnitude faster than normal malloc/free.
But if you're at this stage of optimization - presumably you've exhausted the possibility of optimizing your algorithms and data structure usage?