What are the differences between Block, Stack and Scratch Allocators? - c++

In his talk "Solving the Right Problems for Engine Developers", Mike Acton says that
the vast majority of the time, all you're going to need are these three types of allocator: there's the block allocator, the stack allocator and the scratch allocator
However, he doesn't go into detail about what the differences between these types of allocator are.
I would presume a 'stack allocator' is just a stack-based allocator, but all the other types I've heard of (including 'arena') just sound like fancy ways of doing the same thing, that is 'allocate a big block and chunk it up in a nice efficient way, then free it when you're done'
So, what are the differences between these allocators, what are the advantages of each, why do I only need these three 'the vast majority of the time'?

As was pointed out in the comments, the terminology used in the talk is not well established around the industry, so there is some doubt left as to what exact allocation strategies are being referred to here. Taking into account what is commonly mentioned in game programming literature, here is my educated guess what is behind the three mentioned allocators:
Block Allocator
Also known as a pool allocator. This is an allocator that only hands out fized-sized blocks of memory, regardless of how much memory the user actually requested.
Let's say you have a block allocator with a block size of 100 bytes. You want to allocate memory for a single 64 bit integer? It gives you a block of 100 bytes. You want to allocate memory for an array of 20 single precision floats? It gives you a block of 100 bytes. You want to allocate memory for an ASCII string with 101 characters? It gives you an error, as it can't fit your string into 100 bytes.
Block allocators have several advantages. They are relatively easy to implement and they don't suffer from external memory fragmentation. They also usually exhibit a very predictable runtime behavior, which is often essential for video games. They are well suited for problems where most allocations are of roughly the same size and obviously less well suited for when that is not the case.
Apart from the simplest version described here, where each allocator supports only a single block size, extensions exist that are more flexible, supporting multiple block sizes, without compromising too heavily on the aforementioned advantages.
Stack Allocator
A stack allocator works like a stack: You can only deallocate in the inverse order of allocation. If you subsequently allocate objects A and then B, you cannot reclaim the memory for A without also giving up B.
Stack allocators are very easy to implement, as you only need to keep track of a single pointer that marks the separation between the used and unused regions of memory. Allocation moves that pointer into one direction and deallocation moves it the opposite way.
Stack allocators make optimally efficient use of memory and have fully predictable runtime behavior. They obviously work well only for problems where the required order of deallocations is easy to achieve. It is usually not trivial to enforce the correct deallocation order statically, so debugging them can be a pain if they are being used carelessly.
Scratch Allocator
Also known as a monotonic allocator. A scratch allocator works similar to a stack allocator. Allocation works exactly the same. Deallocation is a no-op. That is, once memory has been allocated it cannot be reclaimed.
If you want to get the memory back, you have to destroy the entire scratch allocator, thereby releasing all of its memory at once.
The advantages of the scratch allocator are the same as with the stack allocator. They are well suited for problems where you can naturally identify points at which all allocated objects are no longer needed. Similar to the stack allocator, when used carelessly, they can lead to nasty runtime errors if an allocator is destroyed while there are still active objects alive.
Why do I only need those three?
Experience shows that in a lot of domains, fully dynamic memory management is not required. Allocation lifetimes can either be grouped by common size (block allocator) or by common lifetimes (scratch and stack allocator). If an engineer working in such a domain is willing to go through the troubles of classifying each allocation accordingly, they can probably make due with just these three allocation strategies for the majority of their dynamic memory needs, without introducing unreasonable additional development efforts. As a reward for their efforts, they will benefit from the nice runtime properties of these algorithms, in particular very fast and predictable execution times, and predictable memory consumption.
If you are in a domain where it is harder to classify allocations along those terms; or if you can not or are unwilling to spend the additional engineering effort; or if you are dealing with a special use case that doesn't map well to those three allocators - you will probably still want to use a general purpose allocator, i.e. good old malloc.
The point that was being made in the talk is more that if you do need to worry about custom memory allocation - and especially in the domain of video games with its specific requirements and trade offs - those three types of allocators are very good answers to the specific problems that you may otherwise encounter when naïvely relying on the general purpose allocator alone.
I gave a long talk about allocators in C++ a while back where I explain all this in more detail if you still want to know more.

Allocator
An allocator in c++ can define a set of functions for allocating and deallocating memory dynamically. There are severals contaianers such as vector,lis,deque. The allocator is used by the containers. The std::allocator class template is used as a default allocator for most containers.
The are several allocators are given below
std::allocator
std::pmr::polymorphic_allocator
pool allocator/block allocator
stack allocator
scratch alocator
Block Allocator
Block allocator is an allcator that can manage a pool of a memory and
can allocate memory block from the pool. Pool allocators are useful for applications where memory allocation and deallocation are performance-critical and where the size of the memory blocks being allocated is known ahead of time.
Stack Allocator
Stack allocator working procedure is just like stack data structure. Stack data structure follows LIFO meaning that last in first out. We use stack structure to store data in memory in the last index one after another. And when we need to remove data from stack then we remove data from last index instead of other index.
We can use the stack allocator at the same process as stack data structure. Suppose a container x is allocated and when needed to add another container y then the container is added in the last of x. Similarly when needed to remove a container the at first the container y will be removed instead of first container.
Scratch Allocator
A scratch allocator is a type of memory allocator that can be used to manage temporary memory that is needed for the duration of a specific function or task.

Related

Performance and security in C++ when avoiding use of pointer

I'm trying to create a class in C++ with an idea of absolute encapsulation and efficiency for the sake of practice. In my case this means every data member is supposed to be inside the class with no pointers pointing outside (e.g. to dynamically allocated storage).
For example, I'm using
char name [10];
instead of
std::string name;
char* name;
My idea is that objects of the class are created as completely enclosed blocks on the stack. As well as that performance is increased, since, if I remember correctly, access to the stack is considerably faster than to the heap.
Am I correct in those assumptions?
And is this idea of absolute encapsulation sensible outside practice? (For example to ensure safety, since there seems to be no risk of memory mismanagement or buffer overflow)
access to the stack is considerably faster than to the heap
This is false: an access to memory is an access to memory. Two things might have confused you here.
First, it is true that different types of memory can be accessed at different speeds. For example, the disk is usually the slowest (without talking about networking, which complicates things even further), while registers are usually the fastest. In between is the main memory, or RAM, where both the stack and the heap live. And then you can have caches, different types of disks, and so on.
Second, stack allocation is indeed faster than heap allocation, just because the allocation scheme is simpler. With the stack, as the name implies, you can only allocate and deallocate at the end, meaning you need to follow a specific order. With the heap, you can allocate pretty much anywhere, meaning that you can deallocate at any point and in any order. This implies some kind of management of the memory that comes with its own problems, for example fragmentation.
is this idea of absolute encapsulation sensible outside practice?
First of all, only using the stack is impossible in practice simply because of its limited size. While this size can vary in practice, it's unlikely to be more than 8MB currently. As soon as you need to load a file larger than that, you cannot do it on the stack.
However, even if stack size was practically unlimited, you still need to deallocate things in the reverse order that you allocated them, otherwise it no longer is a stack. Many things are infeasible that way. For example, as soon as you want interactivity, you need some sort of event processing (to respond to user input), and this is usually done with a queue, which is like the opposite of a stack. Sure you could allocate an insanely large queue, but that's infeasible in practice. Another example that comes to mind is networking. If you want to deal with multiple connections at once (like a web browser for example), you need to deal with the memory associated to each one independantly. Again, you could allocate an insane amount of memory to each connection, but again, that's infeasible in practice.
Also, note that encapsulation does not mean "no pointers to dynamically allocated memory". Instead, "hidden memory management" would be closer to the meaning of this concept.

Preallocate memory for dynamic data structure

I have a question/curiosity.
Let's say I want to implement a list, and for example I could basically use the cormen book approach. Where it is explained how to implement, insert, delete, key search etc.
However nothing is said for what the memory use is concerned. For example if I would like to insert an integer, in a list of integers. I could for example first create a node (I allocate memory there) insert the integer and then insert the node in the list. If I would like to delete an integers, once I know in which node is stored, I have to free the memory.
I was now wondering if instead it would be more convenient to preallocate memory to store, say, 10 nodes and keeping a pointer to a free node to be used. If the memory pool is full then I reallocate memory for 20 nodes, if the pool is the large I half the size of such pool (and so on and so forth). The pool is of course more complicated to manage since I'd need for example to handle possible memory fragmentation etc.
Does what I'm saying make any sense? Or is it no sense? I've read in a book, for game programming, that memory preallocation could improve performance, but I was wondering how.
This is both a simple and a complex question. If you operate within standard problems, you don't really need to worry about memory allocation. For example, preallocating memory for 10 nodes won't be efficient in any scale, and your performance problems might be elsewhere. However, if your program constantly allocates and deallocates hundreds or thousands of small objects per second, it could lead to memory fragmentation, and you might need to write your custom allocator.
Almost no standard containers don't have any methods to preallocate elements storage, except for std::vector::reserve function. All of them, however, allow to use custom allocators in constructors. Also, there's placement new operator.
You could try to experiment with such things, they're fun to write, just don't use them in production if you absolutely don't have to.
I was now wondering if instead it would be more convenient to preallocate memory to store, say, 10 nodes and keeping a pointer to a free node to be used.
You basically are describing what a pool allocator usually does (I assume you are talking about nodes of constant size). So, the short answer to your question is: yes you would improve performance by using a pool allocator with a list container.
Memory allocators shipped with common compilers are quite good for general purpose allocation (i.e. for allocation of random size objects). However, when your need is to allocate objects of constant size, you should consider using a custom pool allocator. You can easily understand why a constant size objects allocator performs faster than the standard one.
You might write your own pool allocator, however it's not an easy task and you should better consider using an existing one, such as boost pool_allocator or fast_pool_allocator.

std::map standard allocator performance versus block allocator

I've read on a C++ optimization cookbook that the standard allocator for STL containers such as std::list, std::set, std::multi_set, std::map, e std::multi_map can be replaced by a more performant block allocator.
A block allocator has higher performance, low fragmentation and efficient data caching.
I've found on the web the FSBAllocator which claims to be faster than the standard.
http://warp.povusers.org/FSBAllocator/
I've tried it with std::map and found the seems to be faster indeed, but my question is how can be the STL implementation be so slower than a specific allocator and what are the drawbacks of another allocator than the standard, in terms of both portability and robustness? My code must compile on a variety of architectures (win32, osx, linux).
Do someone had experience with that kind of fixed size block allocator?
A block allocator does one large allocation to the free store/heap, it then internally splits this memory into chunks. One drawback here is that it allocates this chunk ( which needs to be large, and often user specified on a per use case basis ) straight up, so even if you don't use all of it, that memory is tied up. Secondly the standard memory allocator is built on top of new / delete, which in turn is often built on top of malloc / free. Although I don't recall whether malloc / free is guaranteed to be thread safe under all circumstances, it usually is.
But lastly the reason for why block allocators work so well is because they have information that is not present to the standard allocators, and they don't need to cover a very wide set of use cases. For example, if you did std::map< int, int >() and it allocated 1mb you'd probably be pissed, but if you do std::map< int, int, std::less< int >, block_alloc< 1024 * 1024 > >() you'd be expecting it. The standard allocators doesn't allocate in blocks, they request new memory via new, new in turn have no context at all. It gets a memory request of an arbitrary size and needs to find a contiguous number of bytes to return. What most implementations does is that they have a set of memory areas which they maintain of different multiples ( for example, a region for 4 bytes is more or less probably guaranteed to be present, as requests for 4 bytes is very common ). If a request isn't an even multiple it gets harder to return a good chunk without wasting space and cause fragmentation. Basically memory management of arbitrary sizes is very very hard to do if you want it to be close to constant time, low fragmentation, thread safe etc.
The Boost pool allocator documentation has some good info on how a good block allocator can work.

std::map allocation node packing?

I have noticed that the std::map implementation of Visual Studio (2010) allocates a new single block of memory for each node in its red-black tree. That is, for each element in the map a single new block of raw memory will be allocated via operator new ... malloc with the default allocation scheme of the std::map of the Visual Studio STL implementation.
This appears a bit wasteful to me: Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth?
So I'd like the following points clarified:
Is my assertion about the default allocation scheme actually correct?
Do "all" STL implementations of std::map work this way?
Is there anything in the std preventing a std::map implementation from putting its nodes into blocks of memory instead of allocation a new block of memory (via its allocator) for each node? (Complexity guarantees, etc.)?
Note: This is not about premature optimization. If its about optimization, then its about if an app has problem with (std::)map memory fragmentation, are there alternatives to using a custom allocator that uses a memory pool? This question is not about custom allocators but about how the map implementation uses its allocator. (Or so I hope it is.)
Your assertion is correct for most implementations of std::map.
To my knowledge, there is nothing in the standard preventing a map from using an allocation scheme such as you describe. However, you can get what you describe with a custom allocator — but forcing that scheme on all maps could be wasteful. Because map has no a priori knowledge of how it will be used, certain use patterns could prevent deallocations of mostly-unused blocks. For example, say blocks were allocated for 4 nodes at a time, but a particular map is filled with 40 nodes, then 30 nodes erased, leaving a worst case of one node left per block as map cannot invalidate pointers/references/iterators to that last node.
When you insert elements into a map, it's guaranteed that existing iterators won't be invalidated. Therefore, if you insert an element "B" between two nodes A and C that happen to be contiguous and inside the same heap allocated area, you can't shuffle them to make space, and B will have to be put elsewhere. I don't see any particular problem with that, except that managing such complexities will swell the implementation. If you erase elements then iterators can't be invalidated either, which implies any memory allocation has to hang around until all the nodes therein are erased. You'd probably need a freelist within each "swollen node"/vector/whatever-you-want-to-call-it - effectively duplicating at least some of the time-consuming operations that new/delete currently do for you.
I'm quite certain I've never seen an implementation of std::map that attempted to coalesce multiple nodes into a single allocation block. At least right offhand I can't think of a reason it couldn't work, but I think most implementors would see it as unnecessary, and leave optimization of memory allocation to the allocator instead of worrying about it much in the map itself.
Admittedly, most custom allocators are written to deal better with allocation of a large number of small blocks. You could probably render the vast majority of such optimization unnecessary by writing map (and, of course, set, multiset, and multimap) to just use larger allocations instead. OTOH, given that allocators to optimize small block allocations are easily/common/widely available, there's probably not a lot of motivation to change the map implementation this way either.
I think the only thing you cannot do is to invalidate iterators, which you might have to do if you have to reallocate your storage. Having said that, I've seen implementations using single sorted array of objects wrapped in the std::map interface. That was done for a certain reason, of course.
Actually, what you can do is just instantiate your std::map with you custom allocator, which will find memory for new nodes in a special, non-wasteful way.
This appears a bit wasteful to me. Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth
Interestingly I see it in a completely different way. I find it is appropriate and it doesn't waste any memory. At least with defaul STL allocators on Windows (MS VS 2008), HP-UX (gcc with STLport) and Linux (gcc without STLport). What is important is that these allocators do care about memory fragmentation and it seems they can handle this question pretty well. For example, look for Low-fragmentation Heap on Windows or SBA (Small block allocator) on HP-UX. I mean that frequently allocating and deallocating memory only for one node at a time doesn't have to result in memory fragmentation. I tested std::map myself in one of my programs and it indeed didn't cause any memory fragmentation with these allocators.
Is my assertion about the default
allocation scheme actually correct?
I have MS VisualStudio 2008 and its std::map behaves in the same way. On HP-UX I use gcc with and without STLport and it seem that their STL maps have the same approach to allocating memory for nodes in the std::map.
is there anything in the std
preventing a std::map implementation
from putting its nodes into blocks of
memory instead of allocation a new
block of memory (via its allocator)
for each node?
Start with tuning a default allocator on your platform if it is possible. It is useful here to quote the Douglas Lea who is an author of DL-Malloc
... first I wrote a number of
special-purpose allocators in C++,
normally by overloading operator new
for various classes. ...
However, I soon realized that building
a special allocator for each new class
that tended to be dynamically
allocated and heavily used was not a
good strategy when building kinds of
general-purpose programming support
classes I was writing at the time.
(From 1986 to 1991, I was the the
primary author of libg++ , the GNU C++
library.) A broader solution was
needed -- to write an allocator that
was good enough under normal C++ and C
loads so that programmers would not be
tempted to write special-purpose
allocators except under very special
conditions.
Or as a little bit more difficult idea you can even try to test your application with Hoard allocator. I mean just test your application and see if there is any benefit as for performance or fragmentation.

Designing and coding a non-fragmentizing static memory pool

I have heard the term before and I would like to know how to design and code one.
Should I use the STL allocator if available?
How can it be done on devices with no OS?
What are the tradeoffs between using it and using the regular compiler implemented malloc/new?
I would suggest that you should know that you need a non-fragmenting memory allocator before you put much effort into writing your own. The one provided by the std library is usually sufficient.
If you need one, the general idea of reducing fragmentation is to grab large blocks of memory at once and allocate from the pool rather than asking the OS to provide you with heap memory sporadically and at highly varying places within the heap and interspersed with many other objects with varying sizes. Since the author of the specialized memory allocator has more knowledge on the size of the objects allocated from the pool and how those allocations occur, the allocator can use the memory more efficiently than a general purpose allocator such as the one provided by the STL.
You can look at memory allocators such as Hoard which while reducing memory fragmentation, also can increase performance by providing thread specific heaps which reduce contention. This can help your application scale more linearly, especially on multi-core platforms.
More info on multi-threaded allocators can be found here.
Will try to describe what is essentially a memory pool - I'm just typing this off the top of my head, been a while since I've implemented one, if something is obviously stupid, it's just a suggestion! :)
1.
To reduce fragmentation, you need to create a memory pool that is specific to the type of object you are allocating in it. Essentially, you then restrict the size of each allocation to the size of the object you are interested in. You could implement a templated class which has a list of dynamically allocated blocks (the reason for the list being that you can grow the amount of space available). Each dynamically allocated block would essentially be an array of T.
You would then have a "free" list, which is a singly linked list, where the head points to the next available block. Allocation is then simply returning the head. You could overlay the linked list in the block itself, i.e. each "block" (which represents the aligned size of T), would essentially be a union of T and a node in the linked list, when allocated, it's T, when freed, a node in the list. !!There are obvious dangers!! Alternatively, you could allocate a separate (and protected block, which adds more overhead) to hold an array of addresses in the block.
Allocating is trivial, iterate through the list of blocks and allocate from first available, freeing is also trivial, the additional check you have to do is the find the block from which this is allocated and then update the head pointer. (note, you'll need to use either placement new or override the operator new/delete in T - there are ways around this, google is your friend)
The "static" I believe implies a singleton memory pool for all objects of type T. The downside is that for each T you have to have a separate memory pool. You could be smart, and have a single object that manages pools of different size (using an array of pointers to pool objects where the index is the size of the object for example).
The whole point of the previous paragraph is to outline exactly how complex this is, and like RC says above, be sure you need it before you do it - as it is likely to introduce more pain than may be necessary!
2.
If the STL allocator meets your needs, use it, it's designed by some very smart people who know what they are doing - however it is for the generic case and if you know how your objects are allocated, you could make the above perform faster.
3.
You need to be able to allocate memory somehow (hardware support or some sort of HAL - whatever) - else I'm not sure how your program would work?
4.
The regular malloc/new does a lot more stuff under the covers (google is your friend, my answer is already an essay!) The simple allocator I describe above isn't re-entrant, of course you could wrap it with a mutex to provide a bit of cover, and even then, I would hazard that the simple allocator would perform orders of magnitude faster than normal malloc/free.
But if you're at this stage of optimization - presumably you've exhausted the possibility of optimizing your algorithms and data structure usage?