Preventing memory freeing in STL Container - c++

I have an STL container (std::list) that I am constantly reusing. By this I mean I
push a number of elements into the container
remove the elements during processing
clear the container
rinse and repeat a large number of times
When profiling using callgrind I am seeing a large number of calls to new (malloc) and delete (free) which can be quite expensive. I am therefore looking for some way to preferably preallocate a reasonably large number of elements. I would also like my allocation pool to continue to increase until a high water mark is reach and for the allocation pool to continue to hang onto the memory until the container itself is deleted.
Unfortunately the standard allocator continually resizes the memory pool so I am looking for some allocator that will do the above without me having to write my own.
Does such an allocator exist and where can I find such an allocator?
I am working on both Linux using GCC and Android using the STLPort.
Edit: Placement new is ok, what I want to minimize is heap walking which is expensive. I would also like all my object to be as close to eachother as possible to minimize cache misses.

It sounds like you may be just using the wrong kind of container: With a list, each element occupies a separate chunk of memory, to allow individual inserts/deletes - so every addition/deletion form the list will require a separate new()/delete().
If you can use a std::vector instead, then you can reserve the required size before adding the items.
Also for deletion, it's usually best not to remove the items individually. Just call clear() on the container to empty. it.
Edit: You've now made it clear in the comments that your 'remove the elements during processing' step is removing elements from the middle of the list and must not invalidate iterators, so switching to a vector is not suitable. I'll leave this answer for now (for the sake of the comment thread!)

The allocator boost::fast_pool_allocator is designed for use with std::list.
The documentation claims that "If you are seriously concerned about performance, use boost::fast_pool_allocator when dealing with containers such as std::list, and use boost::pool_allocator when dealing with containers such as std::vector."
Note that boost::fast_pool_allocator is a singleton and by default it never frees allocated memory. However, it is implemented using boost::singleton_pool and you can make it free memory by calling the static functions boost::singleton_pool::release_memory() and boost::singleton_pool::purge_memory().

You can try and benchmark your app with http://goog-perftools.sourceforge.net/doc/tcmalloc.html, I've seen some good improvements in some of my projects (no numbers at hand though, sorry)
EDIT: Seems the code/download has been moved there: http://code.google.com/p/gperftools/?redir=1

Comment was too short so I will post my thoughts as an answer.
IMO, new/delete can come only from two places in this context.
I believe std::list<T> is implemented with some kind of nodes as normally lists are, for various reasons. Therefore, each insertion and removal of an element will have to result in new/delete of a node. Moreover, if the object of type T has any allocations and deallocations in c'tor/d'tor, they will be called as well.
You can avoid recreation of standard stuff by reiterating over existing nodes instead of deleting them. You can use std::vector and std::vector::reserve or std::array if you want to squeeze it to c-level.
Nonetheless, for every object created there must be called a destructor. The only way I see to avoid creations and destructions is to use T::operator= when reiterating over container, or maybe some c++13 move stuff if its suitable in your case.

Related

Does std::unordered_map::erase actually perform dynamic deallocation?

It isn't difficult to find information on the big-O time behavior of stl container operations. However, we operate in a hard real-time environment, and I'm having a lot more trouble finding information on their heap memory usage behavior.
In particular I had a developer come to me asking about std::unordered_map. We're allowed to be non-realtime at startup, so he was hoping to perform a .reserve() at startup time. However, he's finding he gets overruns at runtime. The operations he uses are lookups, insertions, and deletions with .erase().
I'm a little worried about that .reserve() actually preventing later runtime memory allocations (I don't really understand the explanation of what it does wrt to heap usage), but .erase() in particular I don't see any guarantee whatsoever that it won't be asking the heap for a dynamic deallocation when called.
So the question is what's the specified heap interactions (if any) for std::unordered_map::erase, and if it actually does deallocations, if there's some kind of trick that can be used to avoid them?
The standard doesn't specify container allocation patterns per-se. These are effectively derived from iterator/reference invalidation rules. For example, vector::insert only invalidates all references if the number of elements inserted causes the size of the container to exceed its capacity. Which means reallocation happened.
By contrast, the only operations on unordered_map which invalidates references are those which actually remove that particular element. Even a rehash (which likely allocates memory) does not invalidate references (this is why reserve changes nothing).
This means that each element must be stored separately from the hash table itself. They are individual nodes (which is why it has a node_type extraction interface), and must be able to be allocated and deallocated individually.
So it is reasonable to assume that each insertion or erasure represents at least one allocation/deallocation.
If you're all right with nodes continuing to consume memory, even after they've been removed from the container, you could pretty easily write an Allocator class that basically made deallocation a NOP.
Quite a few real-time systems basically allocate all the memory they're going to use up-front, then once they've finished initialization they neither allocate nor release memory. This would allow you to do pretty much the same thing with an unordered_map.
That said, I'm somewhat skeptical about the benefit in this case. The main strength of unordered_map is supporting insertion and deletion that are usually fast. If you're not going to be doing insertion at runtime, chances are pretty good that it's not a particularly great choice.
If it's a collection that's mostly filled during initialization, then used mostly as-is, with a few items being "removed", but no more being inserted after you finish initialization, you're likely to be better off with a simple sorted array and an interpolating search (or, if the data is distributed extremely unpredictably, maybe a binary search--but an interpolating search is usually better). In this case, I'd handle removal by simply adding a boolean to each item saying whether that item is valid or not. Erase by setting that value to false. If you find such a value during a search, you basically just ignore it.

Preallocate memory for dynamic data structure

I have a question/curiosity.
Let's say I want to implement a list, and for example I could basically use the cormen book approach. Where it is explained how to implement, insert, delete, key search etc.
However nothing is said for what the memory use is concerned. For example if I would like to insert an integer, in a list of integers. I could for example first create a node (I allocate memory there) insert the integer and then insert the node in the list. If I would like to delete an integers, once I know in which node is stored, I have to free the memory.
I was now wondering if instead it would be more convenient to preallocate memory to store, say, 10 nodes and keeping a pointer to a free node to be used. If the memory pool is full then I reallocate memory for 20 nodes, if the pool is the large I half the size of such pool (and so on and so forth). The pool is of course more complicated to manage since I'd need for example to handle possible memory fragmentation etc.
Does what I'm saying make any sense? Or is it no sense? I've read in a book, for game programming, that memory preallocation could improve performance, but I was wondering how.
This is both a simple and a complex question. If you operate within standard problems, you don't really need to worry about memory allocation. For example, preallocating memory for 10 nodes won't be efficient in any scale, and your performance problems might be elsewhere. However, if your program constantly allocates and deallocates hundreds or thousands of small objects per second, it could lead to memory fragmentation, and you might need to write your custom allocator.
Almost no standard containers don't have any methods to preallocate elements storage, except for std::vector::reserve function. All of them, however, allow to use custom allocators in constructors. Also, there's placement new operator.
You could try to experiment with such things, they're fun to write, just don't use them in production if you absolutely don't have to.
I was now wondering if instead it would be more convenient to preallocate memory to store, say, 10 nodes and keeping a pointer to a free node to be used.
You basically are describing what a pool allocator usually does (I assume you are talking about nodes of constant size). So, the short answer to your question is: yes you would improve performance by using a pool allocator with a list container.
Memory allocators shipped with common compilers are quite good for general purpose allocation (i.e. for allocation of random size objects). However, when your need is to allocate objects of constant size, you should consider using a custom pool allocator. You can easily understand why a constant size objects allocator performs faster than the standard one.
You might write your own pool allocator, however it's not an easy task and you should better consider using an existing one, such as boost pool_allocator or fast_pool_allocator.

Why would I prefer using vector to deque

Since
they are both contiguous memory containers;
feature wise, deque has almost everything vector has but more, since it is more efficient to insert in the front.
Why whould anyone prefer std::vector to std::deque?
Elements in a deque are not contiguous in memory; vector elements are guaranteed to be. So if you need to interact with a plain C library that needs contiguous arrays, or if you care (a lot) about spatial locality, then you might prefer vector. In addition, since there is some extra bookkeeping, other ops are probably (slightly) more expensive than their equivalent vector operations. On the other hand, using many/large instances of vector may lead to unnecessary heap fragmentation (slowing down calls to new).
Also, as pointed out elsewhere on StackOverflow, there is more good discussion here: http://www.gotw.ca/gotw/054.htm .
To know the difference one should know how deque is generally implemented. Memory is allocated in blocks of equal sizes, and they are chained together (as an array or possibly a vector).
So to find the nth element, you find the appropriate block then access the element within it. This is constant time, because it is always exactly 2 lookups, but that is still more than the vector.
vector also works well with APIs that want a contiguous buffer because they are either C APIs or are more versatile in being able to take a pointer and a length. (Thus you can have a vector underneath or a regular array and call the API from your memory block).
Where deque has its biggest advantages are:
When growing or shrinking the collection from either end
When you are dealing with very large collection sizes.
When dealing with bools and you really want bools rather than a bitset.
The second of these is lesser known, but for very large collection sizes:
The cost of reallocation is large
The overhead of having to find a contiguous memory block is restrictive, so you can run out of memory faster.
When I was dealing with large collections in the past and moved from a contiguous model to a block model, we were able to store about 5 times as large a collection before we ran out of memory in a 32-bit system. This is partly because, when re-allocating, it actually needed to store the old block as well as the new one before it copied the elements over.
Having said all this, you can get into trouble with std::deque on systems that use "optimistic" memory allocation. Whilst its attempts to request a large buffer size for a reallocation of a vector will probably get rejected at some point with a bad_alloc, the optimistic nature of the allocator is likely to always grant the request for the smaller buffer requested by a deque and that is likely to cause the operating system to kill a process to try to acquire some memory. Whichever one it picks might not be too pleasant.
The workarounds in such a case are either setting system-level flags to override optimistic allocation (not always feasible) or managing the memory somewhat more manually, e.g. using your own allocator that checks for memory usage or similar. Obviously not ideal. (Which may answer your question as to prefer vector...)
I've implemented both vector and deque multiple times. deque is hugely more complicated from an implementation point of view. This complication translates to more code and more complex code. So you'll typically see a code size hit when you choose deque over vector. You may also experience a small speed hit if your code uses only the things the vector excels at (i.e. push_back).
If you need a double ended queue, deque is the clear winner. But if you're doing most of your inserts and erases at the back, vector is going to be the clear winner. When you're unsure, declare your container with a typedef (so it is easy to switch back and forth), and measure.
std::deque doesn't have guaranteed continuous memory - and it's often somewhat slower for indexed access. A deque is typically implemented as a "list of vector".
According to http://www.cplusplus.com/reference/stl/deque/, "unlike vectors, deques are not guaranteed to have all its elements in contiguous storage locations, eliminating thus the possibility of safe access through pointer arithmetics."
Deques are a bit more complicated, in part because they don't necessarily have a contiguous memory layout. If you need that feature, you should not use a deque.
(Previously, my answer brought up a lack of standardization (from the same source as above, "deques may be implemented by specific libraries in different ways"), but that actually applies to just about any standard library data type.)
A deque is a sequence container which allows random access to it's elements but it is not guaranteed to have contiguous storage.
I think that good idea to make perfomance test of each case. And make decision relying on this tests.
I'd prefer std::deque than std::vector in most cases.
You woudn't prefer vector to deque acording to these test results (with source).
Of course, you should test in your app/environment, but in summary:
push_back is basically the same for all
insert, erase in deque are much faster than list and marginally faster than vector
Some more musings, and a note to consider circular_buffer.
On the one hand, vector is quite frequently just plain faster than deque. If you don't actually need all of the features of deque, use a vector.
On the other hand, sometimes you do need features which vector does not give you, in which case you must use a deque. For example, I challenge anyone to attempt to rewrite this code, without using a deque, and without enormously altering the algorithm.
Note that vector memory is re-allocated as the array grows. If you have pointers to vector elements, they will become invalid.
Also, if you erase an element, iterators become invalid (but not "for(auto...)").
Edit: changed 'deque' to 'vector'

std::map allocation node packing?

I have noticed that the std::map implementation of Visual Studio (2010) allocates a new single block of memory for each node in its red-black tree. That is, for each element in the map a single new block of raw memory will be allocated via operator new ... malloc with the default allocation scheme of the std::map of the Visual Studio STL implementation.
This appears a bit wasteful to me: Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth?
So I'd like the following points clarified:
Is my assertion about the default allocation scheme actually correct?
Do "all" STL implementations of std::map work this way?
Is there anything in the std preventing a std::map implementation from putting its nodes into blocks of memory instead of allocation a new block of memory (via its allocator) for each node? (Complexity guarantees, etc.)?
Note: This is not about premature optimization. If its about optimization, then its about if an app has problem with (std::)map memory fragmentation, are there alternatives to using a custom allocator that uses a memory pool? This question is not about custom allocators but about how the map implementation uses its allocator. (Or so I hope it is.)
Your assertion is correct for most implementations of std::map.
To my knowledge, there is nothing in the standard preventing a map from using an allocation scheme such as you describe. However, you can get what you describe with a custom allocator — but forcing that scheme on all maps could be wasteful. Because map has no a priori knowledge of how it will be used, certain use patterns could prevent deallocations of mostly-unused blocks. For example, say blocks were allocated for 4 nodes at a time, but a particular map is filled with 40 nodes, then 30 nodes erased, leaving a worst case of one node left per block as map cannot invalidate pointers/references/iterators to that last node.
When you insert elements into a map, it's guaranteed that existing iterators won't be invalidated. Therefore, if you insert an element "B" between two nodes A and C that happen to be contiguous and inside the same heap allocated area, you can't shuffle them to make space, and B will have to be put elsewhere. I don't see any particular problem with that, except that managing such complexities will swell the implementation. If you erase elements then iterators can't be invalidated either, which implies any memory allocation has to hang around until all the nodes therein are erased. You'd probably need a freelist within each "swollen node"/vector/whatever-you-want-to-call-it - effectively duplicating at least some of the time-consuming operations that new/delete currently do for you.
I'm quite certain I've never seen an implementation of std::map that attempted to coalesce multiple nodes into a single allocation block. At least right offhand I can't think of a reason it couldn't work, but I think most implementors would see it as unnecessary, and leave optimization of memory allocation to the allocator instead of worrying about it much in the map itself.
Admittedly, most custom allocators are written to deal better with allocation of a large number of small blocks. You could probably render the vast majority of such optimization unnecessary by writing map (and, of course, set, multiset, and multimap) to just use larger allocations instead. OTOH, given that allocators to optimize small block allocations are easily/common/widely available, there's probably not a lot of motivation to change the map implementation this way either.
I think the only thing you cannot do is to invalidate iterators, which you might have to do if you have to reallocate your storage. Having said that, I've seen implementations using single sorted array of objects wrapped in the std::map interface. That was done for a certain reason, of course.
Actually, what you can do is just instantiate your std::map with you custom allocator, which will find memory for new nodes in a special, non-wasteful way.
This appears a bit wasteful to me. Wouldn't it make more sense to allocate the nodes in blocks of "(small) n", just as std::vector implementations over-allocate on growth
Interestingly I see it in a completely different way. I find it is appropriate and it doesn't waste any memory. At least with defaul STL allocators on Windows (MS VS 2008), HP-UX (gcc with STLport) and Linux (gcc without STLport). What is important is that these allocators do care about memory fragmentation and it seems they can handle this question pretty well. For example, look for Low-fragmentation Heap on Windows or SBA (Small block allocator) on HP-UX. I mean that frequently allocating and deallocating memory only for one node at a time doesn't have to result in memory fragmentation. I tested std::map myself in one of my programs and it indeed didn't cause any memory fragmentation with these allocators.
Is my assertion about the default
allocation scheme actually correct?
I have MS VisualStudio 2008 and its std::map behaves in the same way. On HP-UX I use gcc with and without STLport and it seem that their STL maps have the same approach to allocating memory for nodes in the std::map.
is there anything in the std
preventing a std::map implementation
from putting its nodes into blocks of
memory instead of allocation a new
block of memory (via its allocator)
for each node?
Start with tuning a default allocator on your platform if it is possible. It is useful here to quote the Douglas Lea who is an author of DL-Malloc
... first I wrote a number of
special-purpose allocators in C++,
normally by overloading operator new
for various classes. ...
However, I soon realized that building
a special allocator for each new class
that tended to be dynamically
allocated and heavily used was not a
good strategy when building kinds of
general-purpose programming support
classes I was writing at the time.
(From 1986 to 1991, I was the the
primary author of libg++ , the GNU C++
library.) A broader solution was
needed -- to write an allocator that
was good enough under normal C++ and C
loads so that programmers would not be
tempted to write special-purpose
allocators except under very special
conditions.
Or as a little bit more difficult idea you can even try to test your application with Hoard allocator. I mean just test your application and see if there is any benefit as for performance or fragmentation.

Which STL container should I use for a FIFO?

Which STL container would fit my needs best? I basically have a 10 elements wide container in which I continually push_back new elements while pop_front ing the oldest element (about a million time).
I am currently using a std::deque for the task but was wondering if a std::list would be more efficient since I wouldn't need to reallocate itself (or maybe I'm mistaking a std::deque for a std::vector?). Or is there an even more efficient container for my need?
P.S. I don't need random access
Since there are a myriad of answers, you might be confused, but to summarize:
Use a std::queue. The reason for this is simple: it is a FIFO structure. You want FIFO, you use a std::queue.
It makes your intent clear to anybody else, and even yourself. A std::list or std::deque does not. A list can insert and remove anywhere, which is not what a FIFO structure is suppose to do, and a deque can add and remove from either end, which is also something a FIFO structure cannot do.
This is why you should use a queue.
Now, you asked about performance. Firstly, always remember this important rule of thumb: Good code first, performance last.
The reason for this is simple: people who strive for performance before cleanliness and elegance almost always finish last. Their code becomes a slop of mush, because they've abandoned all that is good in order to really get nothing out of it.
By writing good, readable code first, most of you performance problems will solve themselves. And if later you find your performance is lacking, it's now easy to add a profiler to your nice, clean code, and find out where the problem is.
That all said, std::queue is only an adapter. It provides the safe interface, but uses a different container on the inside. You can choose this underlying container, and this allows a good deal of flexibility.
So, which underlying container should you use? We know that std::list and std::deque both provide the necessary functions (push_back(), pop_front(), and front()), so how do we decide?
First, understand that allocating (and deallocating) memory is not a quick thing to do, generally, because it involves going out to the OS and asking it to do something. A list has to allocate memory every single time something is added, and deallocate it when it goes away.
A deque, on the other hand, allocates in chunks. It will allocate less often than a list. Think of it as a list, but each memory chunk can hold multiple nodes. (Of course, I'd suggest that you really learn how it works.)
So, with that alone a deque should perform better, because it doesn't deal with memory as often. Mixed with the fact you're handling data of constant size, it probably won't have to allocate after the first pass through the data, whereas a list will be constantly allocating and deallocating.
A second thing to understand is cache performance. Going out to RAM is slow, so when the CPU really needs to, it makes the best out of this time by taking a chunk of memory back with it, into cache. Because a deque allocates in memory chunks, it's likely that accessing an element in this container will cause the CPU to bring back the rest of the container as well. Now any further accesses to the deque will be speedy, because the data is in cache.
This is unlike a list, where the data is allocated one at a time. This means that data could be spread out all over the place in memory, and cache performance will be bad.
So, considering that, a deque should be a better choice. This is why it is the default container when using a queue. That all said, this is still only a (very) educated guess: you'll have to profile this code, using a deque in one test and list in the other to really know for certain.
But remember: get the code working with a clean interface, then worry about performance.
John raises the concern that wrapping a list or deque will cause a performance decrease. Once again, he nor I can say for certain without profiling it ourselves, but chances are that the compiler will inline the calls that the queue makes. That is, when you say queue.push(), it will really just say queue.container.push_back(), skipping the function call completely.
Once again, this is only an educated guess, but using a queue will not degrade performance, when compared to using the underlying container raw. Like I've said before, use the queue, because it's clean, easy to use, and safe, and if it really becomes a problem profile and test.
Check out std::queue. It wraps an underlying container type, and the default container is std::deque.
Where performance really matters, check out the Boost circular buffer library.
I continually push_back new elements
while pop_front ing the oldest element
(about a million time).
A million is really not a big number in computing. As others have suggested, use a std::queue as your first solution. In the unlikely event of it being too slow, identify the bottleneck using a profiler (do not guess!) and re-implement using a different container with the same interface.
Why not std::queue? All it has is push_back and pop_front.
A queue is probably a simpler interface than a deque but for such a small list, the difference in performance is probably negligible.
Same goes for list. It's just down to a choice of what API you want.
Use a std::queue, but be cognizant of the performance tradeoffs of the two standard Container classes.
By default, std::queue is an adapter on top of std::deque. Typically, that'll give good performance where you have a small number of queues containing a large number of entries, which is arguably the common case.
However, don't be blind to the implementation of std::deque. Specifically:
"...deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++)."
To net that out, presuming that a queue entry is something that you'd want to queue, i.e., reasonably small in size, then if you have 4 queues, each containing 30,000 entries, the std::deque implementation will be the option of choice. Conversely, if you have 30,000 queues, each containing 4 entries, then more than likely the std::list implementation will be optimal, as you'll never amortize the std::deque overhead in that scenario.
You'll read a lot of opinions about how cache is king, how Stroustrup hates linked lists, etc., and all of that is true, under certain conditions. Just don't accept it on blind faith, because in our second scenario there, it's quite unlikely that the default std::deque implementation will perform. Evaluate your usage and measure.
This case is simple enough that you can just write your own. Here is something that works well for micro-conroller situations where STL use takes too much space. It is nice way to pass data and signal from interrupt handler to your main loop.
// FIFO with circular buffer
#define fifo_size 4
class Fifo {
uint8_t buff[fifo_size];
int writePtr = 0;
int readPtr = 0;
public:
void put(uint8_t val) {
buff[writePtr%fifo_size] = val;
writePtr++;
}
uint8_t get() {
uint8_t val = NULL;
if(readPtr < writePtr) {
val = buff[readPtr%fifo_size];
readPtr++;
// reset pointers to avoid overflow
if(readPtr > fifo_size) {
writePtr = writePtr%fifo_size;
readPtr = readPtr%fifo_size;
}
}
return val;
}
int count() { return (writePtr - readPtr);}
};