Should I use boost fast pool allocator for following? - c++

I have a server that throughout the course of 24 hours keeps adding new items to a set. Elements are not deleted over the 24 period, just new elements keep getting inserted.
Then at end of period the set is cleared, and new elements start getting added again for another 24 hours.
Do you think a fast pool allocator would be useful here as to reuse the memory and possibly help with fragmentation?
The set grows to around 1 million elements. Each element is about 1k.

It's highly unlikely …but you are of course free to test it in your program.
For a collection of that size and allocation pattern (more! more! more! + grow! grow! grow!), you should use an array of vectors. Just keep it in contiguous blocks and reserve() when they are created and you never need to reallocate/resize or waste space and bandwidth traversing lists. vector is going to be best for your memory layout with a collection that large. Not one big vector (which would take a long time to resize), but several vectors, each which represent chunks (ideal chunk size can vary by platform -- I'd start with 5MB each and measure from there). If you follow, you see there is no need to resize or reuse memory; just create an allocation every few minutes for the next N objects -- there is no need for high frequency/speed object allocation and recreation.
The thing about a pool allocator would suggest you want a lot of objects which have discontiguous allocations, lots of inserts and deletes like a list of big allocations -- this is bad for a few reasons. If you want to create an implementation which optimizes for contiguous allocation at this size, just aim for the blocks with vectors approach. Allocation and lookup will both be close to minimal. At that point, allocation times should be tiny (relative to the other work you do). Then you will also have nothing unusual or surprising about your allocation patterns. However, the fast pool allocator suggests you treat this collection as a list, which will have terrible performance for this problem.
Once you implement that block+vector approach, a better performance comparison (at that point) would be to compare boost's pool_allocator vs std::allocator. Of course, you could test all three, but memory fragmentation is likely going to be reduced far more by that block of vectors approach, if you implement it correctly. Reference:
If you are seriously concerned about performance, use fast_pool_allocator when dealing with containers such as std::list, and use pool_allocator when dealing with containers such as std::vector.

Related

How can heap allocation hurt hardware cache hit ratio?

I have done some tests to investigate the relation between heap allocations and hardware cache behaviour. Empirical results are enlightening but also likely to be misleading, especially between different platforms and complex/indeterministic use cases.
There are two scenarios I am interested in: bulk allocation (to implement a custom memory pool) or consequent allocations (trusting the os).
Below are two example allocation tests in C++
//Consequent allocations
for(auto i = 1000000000; i > 0; i--)
int *ptr = new int(0);
store_ptr_in_some_container(ptr);
//////////////////////////////////////
//Bulk allocation
int *ptr = new int[1000000000];
distribute_indices_to_owners(ptr, 1000000000);
My questions are these:
When I iterate over all of them for a read only operation, how will cache
memory in CPU will likely to partition itself?
Despite empirical results (visible performance boost by bulk
solution), what does happen when some other, relatively very small
bulk allocation overrides cache from previous allocations?
Is it reasonable to mix the two in order two avoid code bloat and maintain code readability?
Where does std::vector, std::list, std::map, std::set stand in these concepts?
A general purpose heap allocator has a difficult set of problems to solve. It needs to ensure that released memory can be recycled, must support arbitrarily sized allocations and strongly avoid heap fragmentation.
This will always include extra overhead for each allocation, book-keeping that the allocator needs. At a minimum it must store the size of the block so it can properly reclaim it when the allocation is released. And almost always an offset or pointer to the next block in a heap segment, allocation sizes are typically larger than requested to avoid fragmentation problems.
This overhead of course affects cache efficiency, you can't help getting it into the L1 cache when the element is small, even though you never use it. You have zero overhead for each array element when you allocate the array in one big gulp. And you have a hard guarantee that each element is adjacent in memory so iterating the array sequentially is going to be as fast as the memory sub-system can support.
Not the case for the general purpose allocator, with such very small allocations the overhead is likely to be 100 to 200%. And no guarantee for sequential access either when the program has been running for a while and array elements were reallocated. Notably an operation that your big array cannot support so be careful that you don't automatically assume that allocating giant arrays that cannot be released for a long time is necessarily better.
So yes, in this artificial scenario is very likely you'll be ahead with the big array.
Scratch std::list from that quoted list of collection classes, it has very poor cache efficiency as the next element is typically at an entirely random place in memory. std::vector is best, just an array under the hood. std::map is usually done with a red-black tree, as good as can reasonably be done but the access pattern you use matters of course. Same for std::set.

c++ unordered_map is there a way to pre-allocate memory for elements if max size known in advance

Looks like reserve/rehash functions only pre-allocate the number of buckets, not memory for the elements(key,vlaue) pairs to be inserted.
Is there a way we can pre-allocate memory for elements also, so low-latency apps dont need to waste time on dynamic memory allocation.
One possibility would be to write your own allocator. This can be especially effective if you have at least a fair idea of how many items are likely to go in the table (so you can pre-allocate space for all of them) and don't care about re-using space for items if they're removed from the table (so your bookkeeping is simple).
In such a case, you can basically pre-allocate space for N objects, and simply keep track of the position of the next item to be allocated. Allocating the object consists of simply returning the address and incrementing the pointer, as in return *next++;
Of course, this doesn't truly eliminate the dynamic allocation--it just makes it cheap enough that you probably don't care about it any more (and since it's supplied as a template parameter, there's a good chance of its being expanded inline, so you don't even get the overhead of a function call in the process.
Even if you can't put up with quite that restrictive of an allocator, a general-purpose allocator for fixed-size objects will still usually be (at least somewhat) faster than one for variable sizes of objects. It still won't eliminate the dynamic allocation, but it may give enough improvement in speed to work quite a bit better for your purpose.

Linked list vs dynamic array for implementing a stack using vector class

I was reading up on the two different ways of implementing a stack: linked list and dynamic arrays. The main advantage of a linked list over a dynamic array was that the linked list did not have to be resized while a dynamic array had to be resized if too many elements were inserted hence wasting alot of time and memory.
That got me wondering if this is true for C++ (as there is a vector class which automatically resizes whenever new elements are inserted)?
It's difficult to compare the two, because the patterns of their memory usage are quite different.
Vector resizing
A vector resizes itself dynamically as needed. It does that by allocating a new chunk of memory, moving (or copying) data from the old chunk to the new chunk, the releasing the old one. In a typical case, the new chunk is 1.5x the size of the old (contrary to popular belief, 2x seems to be quite unusual in practice). That means for a short time while reallocating, it needs memory equal to roughly 2.5x as much as the data you're actually storing. The rest of the time, the "chunk" that's in use is a minimum of 2/3rds full, and a maximum of completely full. If all sizes are equally likely, we can expect it to average about 5/6ths full. Looking at it from the other direction, we can expect about 1/6th, or about 17% of the space to be "wasted" at any given time.
When we do resize by a constant factor like that (rather than, for example, always adding a specific size of chunk, such as growing in 4Kb increments) we get what's called amortized constant time addition. In other words, as the array grows, resizing happens exponentially less often. The average number of times items in the array have been copied tends to a constant (usually around 3, but depends on the growth factor you use).
linked list allocations
Using a linked list, the situation is rather different. We never see resizing, so we don't see extra time or memory usage for some insertions. At the same time, we do see extra time and memory used essentially all the time. In particular, each node in the linked list needs to contain a pointer to the next node. Depending on the size of the data in the node compared to the size of a pointer, this can lead to significant overhead. For example, let's assume you need a stack of ints. In a typical case where an int is the same size as a pointer, that's going to mean 50% overhead -- all the time. It's increasingly common for a pointer to be larger than an int; twice the size is fairly common (64-bit pointer, 32-bit int). In such a case, you have ~67% overhead -- i.e., obviously enough, each node devoting twice as much space to the pointer as the data being stored.
Unfortunately, that's often just the tip of the iceberg. In a typical linked list, each node is dynamically allocated individually. At least if you're storing small data items (such as int) the memory allocated for a node may be (usually will be) even larger than the amount you actually request. So -- you ask for 12 bytes of memory to hold an int and a pointer -- but the chunk of memory you get is likely to be rounded up to 16 or 32 bytes instead. Now you're looking at overhead of at least 75% and quite possibly ~88%.
As far as speed goes, the situation is rather similar: allocating and freeing memory dynamically is often quite slow. The heap manager typically has blocks of free memory, and has to spend time searching through them to find the block that's most suited to the size you're asking for. Then it (typically) has to split that block into two pieces, one to satisfy your allocation, and another of the remaining memory it can use to satisfy other allocations. Likewise, when you free memory, it typically goes back to that same list of free blocks and checks whether there's an adjoining block of memory already free, so it can join the two back together.
Allocating and managing lots of blocks of memory is expensive.
cache usage
Finally, with recent processors we run into another important factor: cache usage. In the case of a vector, we have all the data right next to each other. Then, after the end of the part of the vector that's in use, we have some empty memory. This leads to excellent cache usage -- the data we're using gets cached; the data we're not using has little or no effect on the cache at all.
With a linked list, the pointers (and probable overhead in each node) are distributed throughout our list. I.e., each piece of data we care about has, right next to it, the overhead of the pointer, and the empty space allocated to the node that we're not using. In short, the effective size of the cache is reduced by about the same factor as the overall overhead of each node in the list -- i.e., we might easily see only 1/8th of the cache storing the date we care about, and 7/8ths devoted to storing pointers and/or pure garbage.
Summary
A linked list can work well when you have a relatively small number of nodes, each of which is individually quite large. If (as is more typical for a stack) you're dealing with a relatively large number of items, each of which is individually quite small, you're much less likely to see a savings in time or memory usage. Quite the contrary, for such cases, a linked list is much more likely to basically waste a great deal of both time and memory.
Yes, what you say is true for C++. For this reason, the default container inside std::stack, which is the standard stack class in C++, is neither a vector nor a linked list, but a double ended queue (a deque). This has nearly all the advantages of a vector, but it resizes much better.
Basically, an std::deque is a linked list of arrays of sorts internally. This way, when it needs to resize, it just adds another array.
First, the performance trade-offs between linked-lists and dynamic arrays are a lot more subtle than that.
The vector class in C++ is, by requirement, implemented as a "dynamic array", meaning that it must have an amortized-constant cost for inserting elements into it. How this is done is usually by increasing the "capacity" of the array in a geometric manner, that is, you double the capacity whenever you run out (or come close to running out). In the end, this means that a reallocation operation (allocating a new chunk of memory and copying the current content into it) is only going to happen on a few occasions. In practice, this means that the overhead for the reallocations only shows up on performance graphs as little spikes at logarithmic intervals. This is what it means to have "amortized-constant" cost, because once you neglect those little spikes, the cost of the insert operations is essentially constant (and trivial, in this case).
In a linked-list implementation, you don't have the overhead of reallocations, however, you do have the overhead of allocating each new element on freestore (dynamic memory). So, the overhead is a bit more regular (not spiked, which can be needed sometimes), but could be more significant than using a dynamic array, especially if the elements are rather inexpensive to copy (small in size, and simple object). In my opinion, linked-lists are only recommended for objects that are really expensive to copy (or move). But at the end of the day, this is something you need to test in any given situation.
Finally, it is important to point out that locality of reference is often the determining factor for any application that makes extensive use and traversal of the elements. When using a dynamic array, the elements are packed together in memory one after the other and doing an in-order traversal is very efficient as the CPU can preemptively cache the memory ahead of the reading / writing operations. In a vanilla linked-list implementation, the jumps from one element to the next generally involves a rather erratic jumps between wildly different memory locations, which effectively disables this "pre-fetching" behavior. So, unless the individual elements of the list are very big and operations on them are typically very long to execute, this lack of pre-fetching when using a linked-list will be the dominant performance problem.
As you can guess, I rarely use a linked-list (std::list), as the number of advantageous applications are few and far between. Very often, for large and expensive-to-copy objects, it is often preferable to simply use a vector of pointers (you get basically the same performance advantages (and disadvantages) as a linked list, but with less memory usage (for linking pointers) and you get random-access capabilities if you need it).
The main case that I can think of, where a linked-list wins over a dynamic array (or a segmented dynamic array like std::deque) is when you need to frequently insert elements in the middle (not at either ends). However, such situations usually arise when you are keeping a sorted (or ordered, in some way) set of elements, in which case, you would use a tree structure to store the elements (e.g., a binary search tree (BST)), not a linked-list. And often, such trees store their nodes (elements) using a semi-contiguous memory layout (e.g., a breadth-first layout) within a dynamic array or segmented dynamic array (e.g., a cache-oblivious dynamic array).
Yes, it's true for C++ or any other language. Dynamic array is a concept. The fact that C++ has vector doesn't change the theory. The vector in C++ actually does the resizing internally so this task isn't the developers' responsibility. The actual cost doesn't magically disappear when using vector, it's simply offloaded to the standard library implementation.
std::vector is implemented using a dynamic array, whereas std::list is implemented as a linked list. There are trade-offs for using both data structures. Pick the one that best suits your needs.
As you indicated, a dynamic array can take a larger amount of time adding an item if it gets full, as it has to expand itself. However, it is faster to access since all of its members are grouped together in memory. This tight grouping also usually makes it more cache-friendly.
Linked lists don't need to resize ever, but traversing them takes longer as the CPU must jump around in memory.
That got me wondering if this is true for c++ as there is a vector class which automatically resizes whenever new elements are inserted.
Yes, it still holds, because a vector resize is a potentially expensive operation. Internally, if the pre-allocated size for the vector is reached and you attempt to add new elements, a new allocation takes place and the old data is moved to the new memory location.
From the C++ documentation:
vector::push_back - Add element at the end
Adds a new element at the end of the vector, after its current last element. The content of val is copied (or moved) to the new element.
This effectively increases the container size by one, which causes an automatic reallocation of the allocated storage space if -and only if- the new vector size surpasses the current vector capacity.
http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style
Skip to 44:40. You should prefer std::vector whenever possible over a std::list, as explained in the video, by Bjarne himself. Since std::vector stores all of it's elements next to each other, in memory, and because of that it will have the advantage of being cached in memory. And this is true for adding and removing elements from std::vector and also searching. He states that std::list is 50-100x slower than a std::vector.
If you really want a stack, you should really use std::stack instead of making your own.

Vector of 20,000 small objects vs vector of 20,000 object pointers to 20,000 heap objects

Developing a 32-bit C++/carbon app under OS X Snow Leopard, ran into a problem where an stl vector of approximately 20,000 small objects (72 bytes each) was failing during a reallocation. Seems the vector, which was several megabytes in size, couldn't expand to a contiguous piece of memory, which at the point of failure was only 1.2 MB in size.
GuardMalloc[Appname-33692]: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
GuardMalloc[Appname-35026]: Failed to VM allocate 894752 bytes
GuardMalloc[ Appname-35026]: Explicitly trapping into debugger!!!
#0 0x00a30da8 in GMmalloc_zone_malloc_internal
#1 0x00a31710 in GMmalloc
#2 0x94a54617 in operator new
#3 0x0026f1d3 in __gnu_cxx::new_allocator<DataRecord>::allocate at new_allocator.h:88
#4 0x0026f1f8 in std::_Vector_base<DataRecord, std::allocator<DataRecord> >::_M_allocate at stl_vector.h:117
#5 0x0026f373 in std::vector<DataRecord, std::allocator<DataRecord> >::_M_insert_aux at vector.tcc:275
#6 0x0026f5a6 in std::vector<DataRecord, std::allocator<DataRecord> >::push_back at stl_vector.h:610
I can think of several strategies:
1) Reserve() a really, really big vector as soon as the app launches. However, this assumes the user might not load additional files that contribute to this vector, pushing it beyond the pre-allocated limit and possibly getting back into the same situation.
2) Change the vector of objects/memory allocations into a vector of pointers to objects/memory allocations. Clearly makes the vector itself a more manageable size, but then creates 20,000 small objects (which could eventually become like 50,000 objects, depending on what additional files the user loads). Does this create a gigantic overhead problem?
3) Change from a vector to a list, which may have its own overhead issues.
The vector is being constantly iterated through, and generally only appended to.
Any sage thoughts on these issues?
===============
ADDITIONAL NOTE: this particular vector just holds all imported record, so they can be indexed and sorted by ANOTHER vector that contains a sort order. Once an item is put into this vector, it stays there for the lifetime of the app (also helps support undo operations by making sure the index into the vector always remains the same for that particular object).
I think a std::deque would be more suitable than a std::list or a std::vector in your case. std::list is not efficient in iteration or random indexing, while std::vector is slow to resize (as you have observed). A std::deque does not need large amounts of memory when resizing, at the cost of slightly slower random indexing than a vector.
Don't use continuous storage like a vector. Go for a deque or list and the reallocations will not fail anymore.
If you really need high performance, consider writing your own container (ie ArrayList).
Out of your three options, 1 doesn't seem like a guaranteed solution, while 2 adds complexity and the vector still has to grow.
Option 3 seems somewhat reasonable (or possibly use a deque as mentioned in another answer) because while it's semantically similar to option 2, it provides a more normalized method of allocating a new data object. Of course this assumes that you only append data and don't need random access.
However all that said I find it incredible that your program has fragmented memory so badly that on reasonably modern hardware it can't allocate 1.2MB of memory. Far more likely is that there's some undefined behavior lurking (or possibly a memory leak) in your program causing it to behave in this way, failing to allocate the memory. You could use valgrind to help hunt down what may be going on. Does the same problem happen when you use the builtin new and delete rather than GMalloc?
Is your program being ulimited to only have access to a small amount of memory?
Finally, if valgrind finds nothing and your program really is fragmenting memory horribly, I would consider stepping back and reconsidering your approach. You may want to evaluate an alternate approach that doesn't rely on millions(?) of allocations (I just can't see a small number of allocations fragmenting the heap that much).
if even in the heap, there is not enough contigous space, use deque;
deque allocate not contigous space when it is needed. so it could handle the limit of 1.2 MB
deque is made of some number of blocks of memory not only one contigous space. that s why it could work. but it is not sure(/totaly safe) because you dont control the behaviour of the deque.
see thisabout memory fragmentation (following is copy/paste from the web page):
http://www.design-reuse.com/articles/25090/dynamic-memory-allocation-fragmentation-c.html :
Memory Fragmentation
The best way to understand memory fragmentation is to look at an example. For this example, it is assumed hat there is a 10K heap. First, an area of 3K is requested, thus:
#define K (1024)
char *p1;
p1 = malloc(3*K);
Then, a further 4K is requested:
p2 = malloc(4*K);
3K of memory is now free.
Some time later, the first memory allocation, pointed to by p1, is de-allocated:
free(p1);
This leaves 6K of memory free in two 3K chunks. A further request for a 4K allocation is issued:
p1 = malloc(4*K);
This results in a failure – NULL is returned into p1 – because, even though 6K of memory is available, there is not a 4K contiguous block available. This is memory fragmentation.
this is an issue even for kernels using virtual memory such as osx.

Given an Array, is there an algorithm that can allocate memory out of it?

I'm doing some graphics programming and I'm using Vertex pools. I'd like to be able to allocate a range out of the pool and use this for drawing.
Whats different from the solution I need than from a C allocator is that I never call malloc. Instead I preallocate the array and then need an object that wraps that up and keeps track of the free space and allocates a range (a pair of begin/end pointers) from the allocation I pass in.
Much thanks.
in general: you're looking for a memory mangager, which uses a (see wikipedia) memory pool (like the boost::pool as answered by TokenMacGuy). They come in many flavours. Important considerations:
block size (fixed or variable; number of different block sizes; can the block size usage be predicted (statistically)?
efficiency (some managers have 2^n block sizes, i.e. for use in network stacks where they search for best fit block; very good performance and no fragementation at the cost of wasting memory)
administration overhead (I presume that you'll have many, very small blocks; so the number of ints and pointers maintainted by the memory manager is significant for efficiency)
In case of boost::pool, I think the simple segragated storage is worth a look.
It will allow you to configure a memory pool with many different block sizes for which a best-match is searched for.
boost::pool does this for you very nicely!
Instead I preallocate the array and then need an object that wraps that up and keeps track of the free space and allocates a range (a pair of begin/end pointers) from the allocation I pass in.
That's basically what malloc() does internally (malloc() can increase the size of this "preallocated array" if it gets full, though). So yes, there is an algorithm for it. There are many, in fact, and Wikipedia gives a basic overview. Different strategies can work better in different situations. (E.g. if all the blocks are a similar size, or if there's some pattern to allocation and freeing)
If you have many objects of the same size, look into obstacks.
You probably don't want to write the code yourself, it's not an easy task and bugs can be painful.