std::map standard allocator performance versus block allocator

std::map standard allocator performance versus block allocator - c++

I've read on a C++ optimization cookbook that the standard allocator for STL containers such as std::list, std::set, std::multi_set, std::map, e std::multi_map can be replaced by a more performant block allocator.
A block allocator has higher performance, low fragmentation and efficient data caching.
I've found on the web the FSBAllocator which claims to be faster than the standard.
http://warp.povusers.org/FSBAllocator/
I've tried it with std::map and found the seems to be faster indeed, but my question is how can be the STL implementation be so slower than a specific allocator and what are the drawbacks of another allocator than the standard, in terms of both portability and robustness? My code must compile on a variety of architectures (win32, osx, linux).
Do someone had experience with that kind of fixed size block allocator?

A block allocator does one large allocation to the free store/heap, it then internally splits this memory into chunks. One drawback here is that it allocates this chunk ( which needs to be large, and often user specified on a per use case basis ) straight up, so even if you don't use all of it, that memory is tied up. Secondly the standard memory allocator is built on top of new / delete, which in turn is often built on top of malloc / free. Although I don't recall whether malloc / free is guaranteed to be thread safe under all circumstances, it usually is.
But lastly the reason for why block allocators work so well is because they have information that is not present to the standard allocators, and they don't need to cover a very wide set of use cases. For example, if you did std::map< int, int >() and it allocated 1mb you'd probably be pissed, but if you do std::map< int, int, std::less< int >, block_alloc< 1024 * 1024 > >() you'd be expecting it. The standard allocators doesn't allocate in blocks, they request new memory via new, new in turn have no context at all. It gets a memory request of an arbitrary size and needs to find a contiguous number of bytes to return. What most implementations does is that they have a set of memory areas which they maintain of different multiples ( for example, a region for 4 bytes is more or less probably guaranteed to be present, as requests for 4 bytes is very common ). If a request isn't an even multiple it gets harder to return a good chunk without wasting space and cause fragmentation. Basically memory management of arbitrary sizes is very very hard to do if you want it to be close to constant time, low fragmentation, thread safe etc.
The Boost pool allocator documentation has some good info on how a good block allocator can work.

Related

What are the differences between Block, Stack and Scratch Allocators?

In his talk "Solving the Right Problems for Engine Developers", Mike Acton says that
the vast majority of the time, all you're going to need are these three types of allocator: there's the block allocator, the stack allocator and the scratch allocator
However, he doesn't go into detail about what the differences between these types of allocator are.
I would presume a 'stack allocator' is just a stack-based allocator, but all the other types I've heard of (including 'arena') just sound like fancy ways of doing the same thing, that is 'allocate a big block and chunk it up in a nice efficient way, then free it when you're done'
So, what are the differences between these allocators, what are the advantages of each, why do I only need these three 'the vast majority of the time'?

As was pointed out in the comments, the terminology used in the talk is not well established around the industry, so there is some doubt left as to what exact allocation strategies are being referred to here. Taking into account what is commonly mentioned in game programming literature, here is my educated guess what is behind the three mentioned allocators:
Block Allocator
Also known as a pool allocator. This is an allocator that only hands out fized-sized blocks of memory, regardless of how much memory the user actually requested.
Let's say you have a block allocator with a block size of 100 bytes. You want to allocate memory for a single 64 bit integer? It gives you a block of 100 bytes. You want to allocate memory for an array of 20 single precision floats? It gives you a block of 100 bytes. You want to allocate memory for an ASCII string with 101 characters? It gives you an error, as it can't fit your string into 100 bytes.
Block allocators have several advantages. They are relatively easy to implement and they don't suffer from external memory fragmentation. They also usually exhibit a very predictable runtime behavior, which is often essential for video games. They are well suited for problems where most allocations are of roughly the same size and obviously less well suited for when that is not the case.
Apart from the simplest version described here, where each allocator supports only a single block size, extensions exist that are more flexible, supporting multiple block sizes, without compromising too heavily on the aforementioned advantages.
Stack Allocator
A stack allocator works like a stack: You can only deallocate in the inverse order of allocation. If you subsequently allocate objects A and then B, you cannot reclaim the memory for A without also giving up B.
Stack allocators are very easy to implement, as you only need to keep track of a single pointer that marks the separation between the used and unused regions of memory. Allocation moves that pointer into one direction and deallocation moves it the opposite way.
Stack allocators make optimally efficient use of memory and have fully predictable runtime behavior. They obviously work well only for problems where the required order of deallocations is easy to achieve. It is usually not trivial to enforce the correct deallocation order statically, so debugging them can be a pain if they are being used carelessly.
Scratch Allocator
Also known as a monotonic allocator. A scratch allocator works similar to a stack allocator. Allocation works exactly the same. Deallocation is a no-op. That is, once memory has been allocated it cannot be reclaimed.
If you want to get the memory back, you have to destroy the entire scratch allocator, thereby releasing all of its memory at once.
The advantages of the scratch allocator are the same as with the stack allocator. They are well suited for problems where you can naturally identify points at which all allocated objects are no longer needed. Similar to the stack allocator, when used carelessly, they can lead to nasty runtime errors if an allocator is destroyed while there are still active objects alive.
Why do I only need those three?
Experience shows that in a lot of domains, fully dynamic memory management is not required. Allocation lifetimes can either be grouped by common size (block allocator) or by common lifetimes (scratch and stack allocator). If an engineer working in such a domain is willing to go through the troubles of classifying each allocation accordingly, they can probably make due with just these three allocation strategies for the majority of their dynamic memory needs, without introducing unreasonable additional development efforts. As a reward for their efforts, they will benefit from the nice runtime properties of these algorithms, in particular very fast and predictable execution times, and predictable memory consumption.
If you are in a domain where it is harder to classify allocations along those terms; or if you can not or are unwilling to spend the additional engineering effort; or if you are dealing with a special use case that doesn't map well to those three allocators - you will probably still want to use a general purpose allocator, i.e. good old malloc.
The point that was being made in the talk is more that if you do need to worry about custom memory allocation - and especially in the domain of video games with its specific requirements and trade offs - those three types of allocators are very good answers to the specific problems that you may otherwise encounter when naïvely relying on the general purpose allocator alone.
I gave a long talk about allocators in C++ a while back where I explain all this in more detail if you still want to know more.

Allocator
An allocator in c++ can define a set of functions for allocating and deallocating memory dynamically. There are severals contaianers such as vector,lis,deque. The allocator is used by the containers. The std::allocator class template is used as a default allocator for most containers.
The are several allocators are given below
std::allocator
std::pmr::polymorphic_allocator
pool allocator/block allocator
stack allocator
scratch alocator
Block Allocator
Block allocator is an allcator that can manage a pool of a memory and
can allocate memory block from the pool. Pool allocators are useful for applications where memory allocation and deallocation are performance-critical and where the size of the memory blocks being allocated is known ahead of time.
Stack Allocator
Stack allocator working procedure is just like stack data structure. Stack data structure follows LIFO meaning that last in first out. We use stack structure to store data in memory in the last index one after another. And when we need to remove data from stack then we remove data from last index instead of other index.
We can use the stack allocator at the same process as stack data structure. Suppose a container x is allocated and when needed to add another container y then the container is added in the last of x. Similarly when needed to remove a container the at first the container y will be removed instead of first container.
Scratch Allocator
A scratch allocator is a type of memory allocator that can be used to manage temporary memory that is needed for the duration of a specific function or task.

Adapting a fixed-sized chunk pool allocator to certain STL containers

I have a pool allocator I wrote as an exercise, which implements the C++11 std::allocator requirements up and running which works OK, but the policy I used as a reference (based on the following paper):
https://pdfs.semanticscholar.org/4321/a91d635d023ab25a743c698be219edcdb1a3.pdf
is only really good for allocating a single object into one block of memory with sufficient size for it at a time.
I notice that the std::allocator template method "allocate" has a parameter where STL containers can request a count of blocks to allocate simultaneously. For example it looks like std::basic_string, once it exceeds a certain size string it keeps on the stack, moves the whole thing over to the heap at once by requesting a contiguous block of memory from the allocator large enough to store a char array holding the entire string. std::vector's dynamic expansion seems to work in a similar way.
Is there any way to adapt an allocator designed to return fixed-size chunks the size of the type it's templated on to this type of STL container?

You could go down this route:
On the other hand, multiple instances of numerous fixed-sized
pools can be used to produce a general overall flexible
general solution to work in place of the current system
memory manager.
And treat each different-sized request as a request for a new pool, i.e. your "object size" is actually object*count.
You will burn a lot of RAM.
You could put an upper bound on the array size, and fall back to default generic allocation above that.

How can heap allocation hurt hardware cache hit ratio?

I have done some tests to investigate the relation between heap allocations and hardware cache behaviour. Empirical results are enlightening but also likely to be misleading, especially between different platforms and complex/indeterministic use cases.
There are two scenarios I am interested in: bulk allocation (to implement a custom memory pool) or consequent allocations (trusting the os).
Below are two example allocation tests in C++
//Consequent allocations
for(auto i = 1000000000; i > 0; i--)
int *ptr = new int(0);
store_ptr_in_some_container(ptr);
//////////////////////////////////////
//Bulk allocation
int *ptr = new int[1000000000];
distribute_indices_to_owners(ptr, 1000000000);
My questions are these:
When I iterate over all of them for a read only operation, how will cache
memory in CPU will likely to partition itself?
Despite empirical results (visible performance boost by bulk
solution), what does happen when some other, relatively very small
bulk allocation overrides cache from previous allocations?
Is it reasonable to mix the two in order two avoid code bloat and maintain code readability?
Where does std::vector, std::list, std::map, std::set stand in these concepts?

A general purpose heap allocator has a difficult set of problems to solve. It needs to ensure that released memory can be recycled, must support arbitrarily sized allocations and strongly avoid heap fragmentation.
This will always include extra overhead for each allocation, book-keeping that the allocator needs. At a minimum it must store the size of the block so it can properly reclaim it when the allocation is released. And almost always an offset or pointer to the next block in a heap segment, allocation sizes are typically larger than requested to avoid fragmentation problems.
This overhead of course affects cache efficiency, you can't help getting it into the L1 cache when the element is small, even though you never use it. You have zero overhead for each array element when you allocate the array in one big gulp. And you have a hard guarantee that each element is adjacent in memory so iterating the array sequentially is going to be as fast as the memory sub-system can support.
Not the case for the general purpose allocator, with such very small allocations the overhead is likely to be 100 to 200%. And no guarantee for sequential access either when the program has been running for a while and array elements were reallocated. Notably an operation that your big array cannot support so be careful that you don't automatically assume that allocating giant arrays that cannot be released for a long time is necessarily better.
So yes, in this artificial scenario is very likely you'll be ahead with the big array.
Scratch std::list from that quoted list of collection classes, it has very poor cache efficiency as the next element is typically at an entirely random place in memory. std::vector is best, just an array under the hood. std::map is usually done with a red-black tree, as good as can reasonably be done but the access pattern you use matters of course. Same for std::set.

Linux Memory Usage in top when using std::vector versus std::list

I have noticed some interesting behavior in Linux with regard to the Memory Usage (RES) reported by top. I have attached the following program which allocates a couple million objects on the heap, each of which has a buffer that is around 1 kilobyte. The pointers to those objects are tracked by either a std::list, or a std::vector. The interesting behavior I have noticed is that if I use a std::list, the Memory Usage reported by top never changes during the sleep periods. However if I use std::vector, the memory usage will drop to near 0 during those sleeps.
My test configuration is:
Fedora Core 16
Kernel 3.6.7-4
g++ version 4.6.3
What I already know:
1. std::vector will re-allocate (doubling its size) as needed.
2. std::list (I beleive) is allocating its elements 1 at a time
3. both std::vector and std::list are using std::allocator by default to get their actual memory
4. The program is not leaking; valgrind has declared that no leaks are possible.
What I'm confused by:
1. Both std::vector and std::list are using std::allocator. Even if std::vector is doing batch re-allocations, wouldn't std::allocator be handing out memory in almost the same arrangement to std::list and std::vector? This program is single threaded after all.
2. Where can I learn about the behavior of Linux's memory allocation. I have heard statements about Linux keeping RAM assigned to a process even after it frees it, but I don't know if that behavior is guaranteed. Why does using std::vector impact that behavior so much?
Many thanks for reading this; I know this is a pretty fuzzy problem. The 'answer' I'm looking for here is if this behavior is 'defined' and where I can find its documentation.
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <vector>
#include <list>
#include <iostream>
#include <memory>
class Foo{
public:
Foo()
{
data = new char[999];
memset(data, 'x', 999);
}
~Foo()
{
delete[] data;
}
private:
char* data;
};
int main(int argc, char** argv)
{
for(int x=0; x<10; ++x)
{
sleep(1);
//std::auto_ptr<std::list<Foo*> > foos(new std::list<Foo*>);
std::auto_ptr<std::vector<Foo*> > foos(new std::vector<Foo*>);
for(int i=0; i<2000000; ++i)
{
foos->push_back(new Foo());
}
std::cout << "Sleeping before de-alloc\n";
sleep(5);
while(false == foos->empty())
{
delete foos->back();
foos->pop_back();
}
}
std::cout << "Sleeping after final de-alloc\n";
sleep(5);
}

The freeing of memory is done on a "chunk" basis. It's quite possible that when you use list, the memory gets fragmented into little tiny bits.
When you allocate using a vector, all elements are stored in one big chunk, so it's easy for the memory freeing code to say "Golly, i've got a very large free region here, I'm going to release it back to the OS". It's also entirely possible that when growing the vector, the memory allocator goes into "large chunk mode", which uses a different allocation method than "small chunk mode" - say for example you allocate more than 1MB, the memory allocation code may see that as a good time to start using a different strategy, and just ask the OS for a "perfect fit" piece of memory. This large block is very easy to release back to he OS when it's being freed.
On the ohter hand if you are adding to a list, you are constantly asking for little bits, so the allocator uses a different strategy of asking for large block and then giving out small portions. It's both difficult and time-consuming to ensure that ALL blocks within a chunk have been freed, so the allocator may well "not bother" - because chances are that there are some regions in there "still in use", and then it can't be freed at all anyways.
I would also add that using "top" as a memory measure isn't a particularly accurate method, and is very unreliable, as it very much depends on what the OS and the runtime library does. Memory belonging to a process may not be "resident", but the process still hasn't freed it - it's just not "present in actual memory" (out in the swap partition instead!)
And to your question "is this defined somewhere", I think it is in the sense that the C/C++ library source cod defines it. But it's not defined in the sense that somewhere it's written that "This is how it's meant to work, and we promise never to hange it". The libraries supplied as glibc and libstdc++ are not going to say that, they will change the internals of malloc, free, new and delete as new technologies and ideas are invented - some may make things better, others may make it worse, for a given scenario.
As has been pointed out in the comments, the memory is not locked to the process. If the kernel feels that the memory is better used for something else [and the kernel is omnipotent here], then it will "steal" the memory from one running process and give it to another. Particularly memory that hasn't been "touched" for a long time.

1 . Both std::vector and std::list are using std::allocator. Even if std::vector is doing batch re-allocations, wouldn't std::allocator be
handing out memory in almost the same arrangement to std::list and
std::vector? This program is single threaded after all.
Well, what are the differences?
std::list allocates nodes one-by-one (each node needs two pointers in addition to your Foo *). Also, it never re-allocates these nodes (this is guaranteed by the iterator invalidation requirements for list). So, the std::allocator will request a sequence of fixed-size chunks from the underlying mechanism (probably malloc which will in turn use the sbrk or mmap system calls). These fixed-size chunks may well be larger than a list node, but if so they'll all be the same default chunk size used by std::allocator.
std::vector allocates a contiguous block of pointers with no book-keeping overhead (that's all in the vector parent object). Every time a push_back would overflow the current allocation, the vector will allocate a new, larger chunk, move everything across to the new chunk, and release the old one. Now, the new chunk will be something like double (or 1.6 times, or whatever) the size of the old one, as is required to keep the amortized constant time guarantee for push_back. So, pretty quickly, I'd expect the sizes it requests to exceed any sensible default chunk size for std::allocator.
So, the the interesting interactions are different: one between between std::vector and the allocator's underlying mechanism, and one between the std::allocator itself and that underlying mechanism.
2 . Where can I learn about the behavior of Linux's memory allocation. I have heard statements about Linux keeping RAM assigned to a process
even after it frees it, but I don't know if that behavior is
guaranteed. Why does using std::vector impact that behavior so much?
There are several levels you might care about:
The container's own allocation pattern: which is hopefully described above
note that in real-world applications, the way a container is used is just as important
std::allocator itself, which may provide a layer of buffering for small allocations
I don't think this is required by the standard, so it's specific to your implementation
The underlying allocator, which depends on your std::allocator implementation (it could for example be malloc, however that is implemented by your libc)
The VM scheme used by the kernel, and its interactions with whatever syscall (3) ultimately uses
In your particular case, I can think of a possible explanation for the vector apparently releasing more memory than the list.
Consider that the vector ends up with a single contiguous allocation, and lots of the Foos will also be allocated contiguously. This means that when you release all this memory, it's pretty easy to figure out that most of the underlying pages are genuinely free.
Now consider that the list node allocations are interleaved 1:1 with the Foo instances. Even if the allocator did some batching, it seems likely that the heap is much more fragmented than in the std::vector case. Therefore, when you release the allocated records, some work would be required to figure out whether an underlying page is now free, and there's no particular reason to expect this will happen (unless a subsequent large allocation encouraged coalescing of heap records).

The answer is the malloc "fastbins" optimization.
std::list creates tiny (less then 64 bytes) allocations and when it frees them up they are not actually freed - but goes to the fastblock pool.
This behavior means that the heap stays fragmented even AFTER the list is cleared and therefore it does not return to the system.
You can either use malloc_trim(128*1024) in order to forcibly clear them.
Or use mallopt(M_MXFAST, 0) in order to disable fastbins altogether.
I find the first solution to be more correct if you call it when you really don't need the memory anymore.

Smaller chunks go through brk and adjusting the data segment and constant splitting and fusion and bigger chunks mmap the process is a little less disturbed. more info (PDF)
also ptmalloc source code.

std::map allocation node packing?

I have noticed that the std::map implementation of Visual Studio (2010) allocates a new single block of memory for each node in its red-black tree. That is, for each element in the map a single new block of raw memory will be allocated via operator new ... malloc with the default allocation scheme of the std::map of the Visual Studio STL implementation.
This appears a bit wasteful to me: Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth?
So I'd like the following points clarified:
Is my assertion about the default allocation scheme actually correct?
Do "all" STL implementations of std::map work this way?
Is there anything in the std preventing a std::map implementation from putting its nodes into blocks of memory instead of allocation a new block of memory (via its allocator) for each node? (Complexity guarantees, etc.)?
Note: This is not about premature optimization. If its about optimization, then its about if an app has problem with (std::)map memory fragmentation, are there alternatives to using a custom allocator that uses a memory pool? This question is not about custom allocators but about how the map implementation uses its allocator. (Or so I hope it is.)

Your assertion is correct for most implementations of std::map.
To my knowledge, there is nothing in the standard preventing a map from using an allocation scheme such as you describe. However, you can get what you describe with a custom allocator — but forcing that scheme on all maps could be wasteful. Because map has no a priori knowledge of how it will be used, certain use patterns could prevent deallocations of mostly-unused blocks. For example, say blocks were allocated for 4 nodes at a time, but a particular map is filled with 40 nodes, then 30 nodes erased, leaving a worst case of one node left per block as map cannot invalidate pointers/references/iterators to that last node.

When you insert elements into a map, it's guaranteed that existing iterators won't be invalidated. Therefore, if you insert an element "B" between two nodes A and C that happen to be contiguous and inside the same heap allocated area, you can't shuffle them to make space, and B will have to be put elsewhere. I don't see any particular problem with that, except that managing such complexities will swell the implementation. If you erase elements then iterators can't be invalidated either, which implies any memory allocation has to hang around until all the nodes therein are erased. You'd probably need a freelist within each "swollen node"/vector/whatever-you-want-to-call-it - effectively duplicating at least some of the time-consuming operations that new/delete currently do for you.

I'm quite certain I've never seen an implementation of std::map that attempted to coalesce multiple nodes into a single allocation block. At least right offhand I can't think of a reason it couldn't work, but I think most implementors would see it as unnecessary, and leave optimization of memory allocation to the allocator instead of worrying about it much in the map itself.
Admittedly, most custom allocators are written to deal better with allocation of a large number of small blocks. You could probably render the vast majority of such optimization unnecessary by writing map (and, of course, set, multiset, and multimap) to just use larger allocations instead. OTOH, given that allocators to optimize small block allocations are easily/common/widely available, there's probably not a lot of motivation to change the map implementation this way either.

I think the only thing you cannot do is to invalidate iterators, which you might have to do if you have to reallocate your storage. Having said that, I've seen implementations using single sorted array of objects wrapped in the std::map interface. That was done for a certain reason, of course.
Actually, what you can do is just instantiate your std::map with you custom allocator, which will find memory for new nodes in a special, non-wasteful way.

This appears a bit wasteful to me. Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth
Interestingly I see it in a completely different way. I find it is appropriate and it doesn't waste any memory. At least with defaul STL allocators on Windows (MS VS 2008), HP-UX (gcc with STLport) and Linux (gcc without STLport). What is important is that these allocators do care about memory fragmentation and it seems they can handle this question pretty well. For example, look for Low-fragmentation Heap on Windows or SBA (Small block allocator) on HP-UX. I mean that frequently allocating and deallocating memory only for one node at a time doesn't have to result in memory fragmentation. I tested std::map myself in one of my programs and it indeed didn't cause any memory fragmentation with these allocators.
Is my assertion about the default
allocation scheme actually correct?
I have MS VisualStudio 2008 and its std::map behaves in the same way. On HP-UX I use gcc with and without STLport and it seem that their STL maps have the same approach to allocating memory for nodes in the std::map.
is there anything in the std
preventing a std::map implementation
from putting its nodes into blocks of
memory instead of allocation a new
block of memory (via its allocator)
for each node?
Start with tuning a default allocator on your platform if it is possible. It is useful here to quote the Douglas Lea who is an author of DL-Malloc
... first I wrote a number of
special-purpose allocators in C++,
normally by overloading operator new
for various classes. ...
However, I soon realized that building
a special allocator for each new class
that tended to be dynamically
allocated and heavily used was not a
good strategy when building kinds of
general-purpose programming support
classes I was writing at the time.
(From 1986 to 1991, I was the the
primary author of libg++ , the GNU C++
library.) A broader solution was
needed -- to write an allocator that
was good enough under normal C++ and C
loads so that programmers would not be
tempted to write special-purpose
allocators except under very special
conditions.
Or as a little bit more difficult idea you can even try to test your application with Hoard allocator. I mean just test your application and see if there is any benefit as for performance or fragmentation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js