How can heap allocation hurt hardware cache hit ratio? - c++

I have done some tests to investigate the relation between heap allocations and hardware cache behaviour. Empirical results are enlightening but also likely to be misleading, especially between different platforms and complex/indeterministic use cases.
There are two scenarios I am interested in: bulk allocation (to implement a custom memory pool) or consequent allocations (trusting the os).
Below are two example allocation tests in C++
//Consequent allocations
for(auto i = 1000000000; i > 0; i--)
int *ptr = new int(0);
store_ptr_in_some_container(ptr);
//////////////////////////////////////
//Bulk allocation
int *ptr = new int[1000000000];
distribute_indices_to_owners(ptr, 1000000000);
My questions are these:
When I iterate over all of them for a read only operation, how will cache
memory in CPU will likely to partition itself?
Despite empirical results (visible performance boost by bulk
solution), what does happen when some other, relatively very small
bulk allocation overrides cache from previous allocations?
Is it reasonable to mix the two in order two avoid code bloat and maintain code readability?
Where does std::vector, std::list, std::map, std::set stand in these concepts?

A general purpose heap allocator has a difficult set of problems to solve. It needs to ensure that released memory can be recycled, must support arbitrarily sized allocations and strongly avoid heap fragmentation.
This will always include extra overhead for each allocation, book-keeping that the allocator needs. At a minimum it must store the size of the block so it can properly reclaim it when the allocation is released. And almost always an offset or pointer to the next block in a heap segment, allocation sizes are typically larger than requested to avoid fragmentation problems.
This overhead of course affects cache efficiency, you can't help getting it into the L1 cache when the element is small, even though you never use it. You have zero overhead for each array element when you allocate the array in one big gulp. And you have a hard guarantee that each element is adjacent in memory so iterating the array sequentially is going to be as fast as the memory sub-system can support.
Not the case for the general purpose allocator, with such very small allocations the overhead is likely to be 100 to 200%. And no guarantee for sequential access either when the program has been running for a while and array elements were reallocated. Notably an operation that your big array cannot support so be careful that you don't automatically assume that allocating giant arrays that cannot be released for a long time is necessarily better.
So yes, in this artificial scenario is very likely you'll be ahead with the big array.
Scratch std::list from that quoted list of collection classes, it has very poor cache efficiency as the next element is typically at an entirely random place in memory. std::vector is best, just an array under the hood. std::map is usually done with a red-black tree, as good as can reasonably be done but the access pattern you use matters of course. Same for std::set.

Related

Dynamic allocation store data in random location in the heap?

I know that local variables will be stored on the stack orderly.
but, when i dynamically allocate variable in the heap memory in c++ like this.
int * a = new int{1};
int * a2 = new int{2};
int * a3 = new int{3};
int * a4 = new int{4};
Question 1 : are these variable stored in contiguous memory location?
Question 2 : if not, is it because dynamic allocation store variables in random location in the heap memory?
Question3 : so does dynamic allocation increase possibility of cache miss and has low spatial locality?
Part 1: Are separate allocations contiguous?
The answer is probably not. How dynamic allocation occurs is implementation dependent. If you allocate memory like in the above example, two separate allocations might be contiguous, but there is no guarantee of this happening (and it should never be relied on to occur).
Different implementations of c++ use different algorithms for deciding how memory is allocated.
Part 2: Is allocation random?
Somewhat; but not entirely. Memory doesn’t get allocated in an intentionally random fashion. Oftentimes memory allocators will try to allocate blocks of memory near each other in order to minimize page faults and cache misses, but it’s not always possible to do so.
Allocation happens in two stages:
The allocator asks for a large chunk of memory from the OS
The takes pieces of that large chunk and returns them whenever you call new, until you ask for more memory than it has to give, in which case it asks for another large chunk from the OS.
This second stage is where an implementation can make attempts to give things you memory that’s near other recent allocations, however it has little control over the first stage (and the OS usually just provides whatever memory is available, without any knowledge of other allocations by your program).
Part 3: avoiding cache misses
If cache misses are a bottleneck in your code,
Try to reduce the amount of indirection (by having arrays store objects by value, rather than by pointer);
Ensure that the memory you’re operating on is as contiguous as the design permits (so use a std::array or std::vector, instead of a linked list, and prefer a few big allocations to lots of small ones); and
Try to design the algorithm so that it has to jump around in memory as little as possible.
A good general principle is to just use a std::vector of objects, unless you have a good reason to use something fancier. Because they have better cache locality, std::vector is faster at inserting and deleting elements than std::list, even up to dozens or even hundreds of elements.
Finally: try to take advantage of the stack. Unless there’s a good reason for something to be a pointer, just declare as a variable that lives on the stack. When possible,
Prefer to use MyClass x{}; instead of MyClass* x = new MyClass{};, and
Prefer std::vector<MyClass> instead of std::vector<MyClass*>.
By extension, if you can use static polymorphism (i.e, templates), use that instead of dynamic polymorphism.
IMHO this is Operating System specific / C++ standard library implementation.
new ultimately uses lower-level virtual memory allocation services and allocating several pages at once, using system calls like mmap and munmap. The implementation of new could reuse previously freed memory space when relevant.
The implementation of new could use various and different strategies for "large" and "small" allocations.
In the example you gave the first new results in a system call for memory allocation (usually several pages), the allocated memory could be large enough so that subsequent new calls results in contiguous allocation..But this depends on the implementation
In short:
not at all (there is padding due to alignment, heap housekeeping data, allocated chunks may be reused, etc.),
not at all (AFAIK, heap algorithms are deterministic without any randomness),
generally yes (e.g., memory pooling might help here).

Why is malloc() based on linked-list?

In the worst case, on a section (is this the right term?) of memory of size n, linked-list needs O(n) time to allocate a block of memory in suitable size.
However, if malloc is tree-based, say, an interval tree, only O(logn) time is needed. Furthermore, a tree can satisfied such requirements without extra time (in terms of time complexity) as "Find the smallest block of free memory whose size is larger themx" , "Always allocate on the borders of free memory" and "Free only a part of the allocated memory". A drawback maybe that freeing memory takes O(logn) time.
Thanks
ps. I've seen the question Data structures for traversable memory pool, but the author doesn't seem to have figured it out.
I don't KNOW the answer, but here's some thoughts:
There is absolutely no REQUIREMENT that malloc is implemented in a particular way. However, an unbalanced tree is just as bad as a linked list in the worst case. A balanced linked list is a lot more maintenance. A tree, where you have two links per node, also takes up more memory than a single-linked list. Removing nodes in a linked list is easier as well as inserting at the end is very easy.
And in most systems, there is (almost) exactly one free for every malloc - so if you make one faster by making the other slower, you gain very little.
It is also relatively common that the "next allocation is the same as the previous one", meaning that if the last allocation is first in list, it's an O(1) operation.
In real-time systems, buckets are often used for allocations, such that there are a number of fixed sizes, and each time something is allocated out of the main heap, the size is rounded to the nearest larger size, and when freed it goes into the bucket of that size (which is a linked list). If there is a free element of that size already, then that allocation is used. Besides the speed of allocation/free being O(1), this has the benefit of reducing fragmentation - it's not completely impossible to "rip all the heap into small pieces, and then not have any large lumps left", but it's at least not possible to take up most of the memory by simply allocating a byte more each time until you have half the heap size in one allocation.
(Also, in Linux's GLIBC, allocations over a certain size do not end up in the linked list at all - they are allocated directly via mmap and released back with munmap when free is called)
Finally, algorithmic complexity is not everything - in real life, it is the actual time spent on something that matters - even if the algorithm has O(n) but each operation is fast, it can beat O(logn). Similarly, particularly in C++, small allocations are highly dominant, which means that overhead of more memory per node is an important factor.
There is no specification which says that malloc needs to be based on a linked list. Between platforms, the implementation may change. On one platform, speed may be crucial and a tree can be implemented, on another platform, memory is more expensive and a linked list (or such) is used in order to save as much memory as possible.

Should I use boost fast pool allocator for following?

I have a server that throughout the course of 24 hours keeps adding new items to a set. Elements are not deleted over the 24 period, just new elements keep getting inserted.
Then at end of period the set is cleared, and new elements start getting added again for another 24 hours.
Do you think a fast pool allocator would be useful here as to reuse the memory and possibly help with fragmentation?
The set grows to around 1 million elements. Each element is about 1k.
It's highly unlikely …but you are of course free to test it in your program.
For a collection of that size and allocation pattern (more! more! more! + grow! grow! grow!), you should use an array of vectors. Just keep it in contiguous blocks and reserve() when they are created and you never need to reallocate/resize or waste space and bandwidth traversing lists. vector is going to be best for your memory layout with a collection that large. Not one big vector (which would take a long time to resize), but several vectors, each which represent chunks (ideal chunk size can vary by platform -- I'd start with 5MB each and measure from there). If you follow, you see there is no need to resize or reuse memory; just create an allocation every few minutes for the next N objects -- there is no need for high frequency/speed object allocation and recreation.
The thing about a pool allocator would suggest you want a lot of objects which have discontiguous allocations, lots of inserts and deletes like a list of big allocations -- this is bad for a few reasons. If you want to create an implementation which optimizes for contiguous allocation at this size, just aim for the blocks with vectors approach. Allocation and lookup will both be close to minimal. At that point, allocation times should be tiny (relative to the other work you do). Then you will also have nothing unusual or surprising about your allocation patterns. However, the fast pool allocator suggests you treat this collection as a list, which will have terrible performance for this problem.
Once you implement that block+vector approach, a better performance comparison (at that point) would be to compare boost's pool_allocator vs std::allocator. Of course, you could test all three, but memory fragmentation is likely going to be reduced far more by that block of vectors approach, if you implement it correctly. Reference:
If you are seriously concerned about performance, use fast_pool_allocator when dealing with containers such as std::list, and use pool_allocator when dealing with containers such as std::vector.

Is a std::vector more likely to fail than a std::list (STL containers)

I understand the difference between vector and list regarding the complexity of common operations on them. But supposing we need to deal with very large lists or vectors (i.e. millions of elements), can we say that memory allocation is more likely to fail when working with vectors?
AFAIK, when working with vectors, the elements are stored contiguously. This means that a big block of memory needs to be allocated in order to store all elements, which seems more likely to fail in case of a fragmented heap.
On the other hand, no big blocks of memory are allocated when working with a list; elements aren't stored contiguously.
There are definitely scenarios under which an std::list is more likely to keep working than a std::vector. The most straight forward one is when applications have a high amount of memory fragmentation. Essentially when allocations are spread out amongst the virtual memory address space. This leads to a large gap between the amount of memory available and the largest contiguous block of memory. In this scenario it's more likely that a std::vector wiht a large amount of elements will fail than an std::list with the same amount of elements.
However this is not something I would base my decision on blindly. If the application in question suffered from memory fragmentation and it was affecting my ability to use an std::vector then I would consider jumping to std::list. But at the start I would choose whatever collection I felt best suited my needs.
The comparison is incomplete without mentioning that the two containers carry different per-element memory requirements:
On average, a vector of size N requires N*sizeof(T) bytes (you may need to trim the memory to get the size to exactly N)
A list of size N requires N*(sizeof(T)+2*sizeof(void*)) bytes.
Since each element of a list carries two additional pointers, a list of very small objects (say, ints) may require three to five times the amount of memory required by a vector of identical size. As the size of a list element increases, the additional memory requirements become less pronounced, but for small objects the implications are huge. That is why it is not fair to say that vectors are more likely to fail on memory allocation: although a vector requires a contiguous block of memory, the total amount may be a lot smaller, letting you have vectors of significantly larger sizes before entering the territory of potential failures.
For objects of sizes significantly (say, ten or more times) larger than two pointers you can safely neglect the additional memory requirements: due to the need of vectors to allocate contiguous space, the possibility of not finding a suitable chunk of memory is significantly higher.

Using realloc in c++

std::realloc is dangerous in c++ if the malloc'd memory contains non-pod types. It seems the only problem is that std::realloc wont call the type destructors if it cannot grow the memory in situ.
A trivial work around would be a try_realloc function. Instead of malloc'ing new memory if it cannot be grown in situ, it would simply return false. In which case new memory could be allocated, the objects copied (or moved) to the new memory, and finally the old memory freed.
This seems supremely useful. std::vector could make great use of this, possibly avoiding all copies/reallocations.
preemptive flame retardant: Technically, that is same Big-O performance, but if vector growth is a bottle neck in your application a x2 speed up is nice even if the Big-O remains unchanged.
BUT, I cannot find any c api that works like a try_realloc.
Am I missing something? Is try_realloc not as useful as I imagine? Is there some hidden bug that makes try_realloc unusable?
Better yet, Is there some less documented API that performs like try_realloc?
NOTE: I'm obviously, in library/platform specific code here. I'm not worried as try_realloc is inherently an optimization.
Update:
Following Steve Jessops comment's on whether vector would be more efficient using realloc I wrote up a proof of concept to test. The realloc-vector simulates a vector's growth pattern but has the option to realloc instead. I ran the program up to a million elements in the vector.
For comparison a vector must allocate 19 times while growing to a million elements.
The results, if the realloc-vector is the only thing using the heap the results are awesome, 3-4 allocation while growing to the size of million bytes.
If the realloc-vector is used alongside a vector that grows at 66% the speed of the realloc-vector The results are less promising, allocating 8-10 times during growth.
Finally, if the realloc-vector is used alongside a vector that grows at the same rate, the realloc-vector allocates 17-18 times. Barely saving one allocation over the standard vector behavior.
I don't doubt that a hacker could game allocation sizes to improve the savings, but I agree with Steve that the tremendous effort to write and maintain such an allocator isn't work the gain.
vector generally grows in large increments. You can't do that repeatedly without relocating, unless you carefully arrange things so that there's a large extent of free addresses just above the internal buffer of the vector (which in effect requires assigning whole pages, because obviously you can't have other allocations later on the same page).
So I think that in order to get a really good optimization here, you need more than a "trivial workaround" that does a cheap reallocation if possible - you have to somehow do some preparation to make it possible, and that preparation costs you address space. If you only do it for certain vectors, ones that indicate they're going to become big, then it's fairly pointless, because they can indicate with reserve() that they're going to become big. You can only do it automatically for all vectors if you have a vast address space, so that you can "waste" a big chunk of it on every vector.
As I understand it, the reason that the Allocator concept has no reallocation function is to keep it simple. If std::allocator had a try_realloc function, then either every Allocator would have to have one (which in most cases couldn't be implemented, and would just have to return false always), or else every standard container would have to be specialized for std::allocator to take advantage of it. Neither option is a great Allocator interface, although I suppose it wouldn't be a huge effort for implementers of almost all Allocator classes just to add a do-nothing try_realloc function.
If vector is slow due to re-allocation, deque might be a good replacement.
You could implement something like the try_realloc you proposed, using mmap with MAP_ANONYMOUS and MAP_FIXED and mremap with MREMAP_FIXED.
Edit: just noticed that the man page for mremap even says:
mremap() uses the Linux page table
scheme. mremap() changes the
mapping between
virtual addresses and memory pages. This can be used to implement
a very efficient
realloc(3).
realloc in C is hardly more than a convenience function; it has very little benefit for performance/reducing copies. The main exception I can think of is code that allocates a big array then reduces the size once the size needed is known - but even this might require moving data on some malloc implementations (ones which segregate blocks strictly by size) so I consider this usage of realloc really bad practice.
As long as you don't constantly reallocate your array every time you add an element, but instead grow the array exponentially (e.g. by 25%, 50%, or 100%) whenever you run out of space, just manually allocating new memory, copying, and freeing the old will yield roughly the same (and identical, in case of memory fragmentation) performance to using realloc. This is surely the approach that C++ STL implementations use, so I think your whole concern is unfounded.
Edit: The one (rare but not unheard-of) case where realloc is actually useful is for giant blocks on systems with virtual memory, where the C library interacts with the kernel to relocate whole pages to new addresses. The reason I say this is rare is because you need to be dealing with very big blocks (at least several hundred kB) before most implementations will even enter the realm of dealing with page-granularity allocation, and probably much larger (several MB maybe) before entering and exiting kernelspace to rearrange virtual memory is cheaper than simply doing the copy. Of course try_realloc would not be useful here, since the whole benefit comes from actually doing the move inexpensively.