Cleaning object pool based on execution time - c++

Problem
My current project implements a pool of objects to avoid constant memory allocation and de-allocation for speed purposes, the object pool is a std::vector<object> and I would like to implement a form of garbage collection to reduce memory usage and increase performance. Every loop the program iterates over the entire vector and if the object is active, executes the update function, this means that if my vector is full of inactive objects, I will be wasting a lot of time iterating over them and also memory storing them. I cannot clean the vector every frame as this would crush performance.
Current attempt
My current attempt at implementing this has been to measure the update time, and use a pre-defined function to determine whether or not we are spending too much time on the update, for the amount of objects currently active, if we are, I clean the vector once to allow the speed to return to normal.
#include <chrono>
void updateObjects()
{
auto begin = std::chrono::high_resolution_clock::now();
//update all objects
for(auto o : objectVec)
{
//only update active objects
if(o.m_alive)
{
o.update();
}
}
//end time of update
auto end = std::chrono::high_resolution_clock::now();
//calculate time taken vs estimated time
auto elapsed = (end-begin).count();
//Estimate is based on performance testing
long estimate = 25*m_particleCount+650000;
//If we have no active objects,
//but are wasting memory on storing them, we clean up
//If the update takes longer than it should, we clean up
if(((objectCount <= 0) && (objectVec.size() > 0)) || (elapsed > estimate))
{
cleanVec(); //remove inactive objects
}
}
This solution works well on my pc, however I am having issues with it on other computers as the time taken for the update to complete varies due to different CPU speeds, and my pre-defined function then doesn't work as it is based off of incorrect data. I am wondering if there is another measurement I can use for this calculation? Is there a way I can measure the pure amount of instructions executed, as this would be the same across computers, some would simply execute them faster? Any other suggestions are welcome, thank you!
Edit
The object pool can be as large as 100,000 objects, typical usage will range from 3000 to the maximum.
My function for cleaning is:
objectVec.erase(std::remove_if(objectVec.begin(),
objectVec.end(),
[](const object& o) {return !(o->m_alive);}),
objectVec.end());

Typically, an object pool uses aging to determine when to expel an individual pool object.
One way to do that is to keep an iteration counter within your update loop. This counter would be incremented every loop. Each object could keep a "last active" time (the loop count the last time the object was active). Inactive objects would be moved to the end of the array, and when old enough would be expelled (destroyed). A separate index of the last active object would be kept, so looping for updates could stop when this was encountered.
If it isn't possible to store an active time in an object, you can still move the inactive objects to the end of the active list, and if there are too many inactive objects pare the list down some.

You should specify how expensive your cleanup is, and how large your pool is, they affect how cleaning up should be implemented.
To make it clear, the pool is effectively an allocator of one type of object and performance gains are completely independent of how individual objects are cleaned up.
I cannot clean the vector every frame as this would crush performance.
If the performance of inactive objects' cleanup dominates, there is nothing you can do in your pool algorithm.
Therefore I assume this is due to std::vector semantics where removing inactive objects involves relocation of other objects. It is then logical to ask do you really need std::vector's properties?
Do you need your objects to be contiguous in memory?
Do you need O(1) random access?
Do you loop over the pool constantly?
Otherwise, is the pool small? As long as it is small std::vector is fine.
If the answer is no, then just use stuff like std::deque and std::list, where removal is O(1) and you can cleanup on every frame.
Otherwise garbage collection of your pool can be done as
keeping a counter of frames since last updated for each object, remove if counter is over a threshold
keeping a counter of total inactive objects, clean pool if percentage of active objects is over a threshold

Related

How to avoid heap allocation inserting Rendercommands to a RenderCommandBuffer?

I have a RenderQueue that sorts a list of elements to render.
Now that RenderQueue creates a RenderCommandBuffer with all the "low level" rendering operations, the problem is that the performance goes from 1400 FPS to 40FPS for 1000 elements
I profiled the app and the problem lies here (those per frame allocations):
std::for_each(element.meshCommands.begin(), element.meshCommands.end(), [&] (auto &command) {
std::vector<std::pair<std::string, glm::mat4> > p{ { "MVP", VPmatrix * command.second} };
m_commandBuffer.addCommand(std::make_shared<SetShaderValuesCommand>(element.material,p));
m_commandBuffer.addCommand(std::make_shared<BindMaterialCommand>(element.material));
m_commandBuffer.addCommand(std::make_shared<RenderMeshCommand>(meshProperty.mesh));
});
I know that I can group my meshes by material, but the problem is more or less the same. Allocation of many objects per frame. How will you avoid this situation? How the game engines deal with this problem ? Memory pools?
Details are scant, but I see two opportunities for tuning.
m_commandBuffer is a polymorphic container of some kind. I completely understand why you would build it this way, but it presents a problem - each element must be separately allocated.
You may well get much better performance by amalgamating all render operations into a variant, and implement m_commandBuffer as a vector (or queue) of such variants. This allows you to reserve() space for the 1000 commands with 1 memory allocation, rather than the (at least) 1000 you currently require.
It also means that you only incur the cost of one memory fence during the allocation, again rather than the thousands you are suffering while incrementing and decrementing all the reference counts in all those shared_ptrs.
So:
using Command = boost::variant< SetShaderValuesCommand, BindMaterialCommand, RenderMeshCommand>;
using CommandQueue = std::deque<Command>;
Executing the commands then becomes:
for (auto& cmd : m_commandBuffer) {
boost::apply_visitor([](auto& actualCmd) {
actualCmd.run(); /* or whatever is the interface */
}, cmd);
}

What can be done to optimize the amount of time it takes to leave a method and empty out the stack of local variables?

I have a method which is responsible for taking an openGl triangle mesh and converting it to a 3ds file. This method is called exportShape(). To perform this conversion exportShape() creates a bunch of very large vectors and hash_maps. Currently, getting from the last line of exportShape() to the next line of code from where exportShape() was called can take up to 5 minutes. I’m sure that all this time is spent emptying out the very large stack of local variables because if I move all the local vectors and hash_maps to global scope the method exists instantly as I would expect.
Why am I able to populate all these local data structures in just a few seconds whereas popping them off the stack takes minutes? How can I optimized the process of leaving my exportShape() and clearing out the stack?
Edit:
The objects which are being deleted contain only strings, doubles and ints - nothing with a custom destructor.
I pretty much solved my own problem. Running in release mode is a huge performance increase (~20x). Nevertheless, the process still hangs for a few seconds. Is there anything else that can be done?
The problem in the first instance is that you're using a debug allocator, which marks freed memory with a bit pattern (e.g. 0xfdfdfdfd) to aid in detecting accesses to freed memory. This obviously takes time as it must iterate over all the freed memory.
To speed things up further, you could use a scoped allocator e.g. the Boost Pool Library; see also Creating a scoped custom memory pool/allocator?

profiling and performance issues

After observing some performance issues in my program, I decided to run a profiling session. The results seem to indicate that something like 87% of samples taken were somehow related to my Update() function.
In this function, I am going through a list of A*, where sizeof(A) equals 72, and deleting them after processing.
void Update()
{
//...
for(auto i = myList.begin(); i != myList.end(); i++)
{
A* pA = *i;
//Process item before deleting it.
delete pA;
}
myList.clear();
//...
}
where myList is a std::list<A*>. On average, I am calling this function anywhere from 30 to 60 times per second while the list contains an average of 5 items. That means I'm deleting anywhere from 150 to 300 A objects per second.
Would calling delete this many times be enough to cause a performance issue in most cases? Is there any way to track down exactly where in the function the problem is occuring? Is delete generally considered an expensive operation?
Very difficult to tell, since you brush over what is probably the bulk of the work done in the loop and give no hint as to what A is...
If A is a simple collection of data, particularly primitives then the deletion is almost certainly not the culprit. You can test the theory by splitting your update function in two - update and uninit. Update does all the processing, uninit deletes the object and clears the list.
If only update is slow, then it's the processing. If only uninit is slow, then it's the deletion. If both are slow then memory fragmentation is probably the culprit.
As others have pointed out in the comments, std::vector may give you a performance increase. But be careful since it may also cause performance problems elsewhere depending on how you build the data structure.
You could have a look at tcmalloc from gperftools (Google Performance Tools). gperftools also contains a profiler (both libraries only need to be linked in, very easy). tcmalloc keeps a memory pool for small objects and re-uses this memory when possible.
The profiler can be used for cpu and heap profiling.
Totally easy to tell what's going on.
Do yourself a favor and use this method.
It's been analyzed to the nth degree, and is very effective.
In a nutshell, if 87% of time is in Update, then if you just stop it a few times with Ctrl-C or whatever, the probability is 87% each time that you will catch it in the act.
You will not only see that it's in Update. You will see where in Update, and what it's doing. If it is in the process of delete, or accessing the data structure, you will see that. You will also see, further down the stack, the reason why that operation takes time.

Thread safe memory pool

My application currently is highly performance critical and is requests 3-5 million objects per frame. Initially, to get the ball rolling, I new'd everything and got the application to work and test my algorithms. The application is multi-threaded.
Once I was happy with the performance, I started to create a memory manager for my objects. The obvious reason is memory fragmentation and wastage. The application could not continue for more than a few frames before crashing due to memory fragmentation. I have checked for memory leaks and know the application is leak free.
So I started creating a simple memory manager using TBB's concurrent_queue. The queue stores a maximum set of elements the application is allowed to use. The class requiring new elements pops elements from the queue. The try_pop method is, according to Intel's documentation, lock-free. This worked quite well as far as memory consumption goes (although there is still memory fragmentation, but not nearly as much as before). The problem I am facing now is that the application's performance has slowed down approximately 4 times according to my own simple profiler (I do not have access to commercial profilers or know of any that will work on a real-time application... any recommendation would be appreciated).
My question is, is there a thread-safe memory pool that is scalable. A must-have feature of the pool is fast recycling of elements and making them available. If there is none, any tips/tricks performance wise?
EDIT: I thought I would explain the problem a bit more. I could easily initialize n number of arrays where n is the number of threads and start using the objects from the arrays per thread. This will work perfectly for some cases. In my case, I am recycling the elements as well (potentially every frame) and they could be recycled at any point in the array; i.e. it may be from elementArray[0] or elementArray[10] or elementArray[1000] part of the array. Now I will have a fragmented array of elements consisting of elements that are ready to be used and elements that are in-use :(
As said in comments, don't get a thread-safe memory allocator, allocate memory per-thread.
As you implied in your update, you need to manage free/in-use effectively. That is a pretty straightforward problem, given a constant type and no concurrency.
For example (off the top of my head, untested):
template<typename T>
class ThreadStorage
{
std::vector<T> m_objs;
std::vector<size_t> m_avail;
public:
explicit ThreadStorage(size_t count) : m_objs(count, T()) {
m_avail.reserve(count);
for (size_t i = 0; i < count; ++i) m_avail.push_back(i);
}
T* alloc() {
T* retval = &m_objs[0] + m_avail.back();
m_avail.pop_back();
return retval;
}
void free(T* p) {
*p = T(); // Assuming this is enough destruction.
m_avail.push_back(p - &m_objs[0]);
}
};
Then, for each thread, have a ThreadStorage instance, and call alloc() and free() as required.
You can add smart pointers to manage calling free() for you, and you can optimise constructor/destructor calling if that's expensive.
You can also look at boost::pool.
Update:
The new requirement for keeping track of things that have been used so that they can be processed in a second pass seems a bit unclear to me. I think you mean that when the primary processing is finished on an object, you need to not release it, but keep a reference to it for second stage processing. Some objects you will just be released back to the pool and not used for second stage processing.
I assume you want to do this in the same thread.
As a first pass, you could add a method like this to ThreadStorage, and call it when you want to do processing on all unreleased instances of T. No extra book keeping required.
void do_processing(boost::function<void (T* p)> const& f) {
std::sort(m_avail.begin(), m_avail.end());
size_t o = 0;
for (size_t i = 0; i != m_avail.size(); ++i) {
if (o < m_avail[i]) {
do {
f(&m_objs[o]);
} while (++o < m_avail[i]);
++o;
} else of (o == m_avail[i])
++o;
}
for (; o < m_objs.size(); ++o) f(&m_objs[o]);
}
Assumes no other thread is using the ThreadStorage instance, which is reasonable because it is thread-local by design. Again, off the top of my head, untested.
Google's TCMalloc,
TCMalloc assigns each thread a
thread-local cache. Small allocations
are satisfied from the thread-local
cache. Objects are moved from central
data structures into a thread-local
cache as needed, and periodic garbage
collections are used to migrate memory
back from a thread-local cache into
the central data structures.
Performance:
TCMalloc is faster than the glibc 2.3 malloc... ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair...
You may want to have a look at jemalloc.

How to store and push simulation state while minimally affecting updates per second?

My app is comprised of two threads:
GUI Thread (using Qt)
Simulation Thread
My reason for using two threads is to keep the GUI responsive, while letting the Sim thread spin as fast as possible.
In my GUI thread I'm rendering the entities in the sim at an FPS of 30-60; however, I want my sim to "crunch ahead" - so to speak - and queue up game state to be drawn eventually (think streaming video, you've got a buffer).
Now for each frame of the sim I render I need the corresponding simulation "State". So my sim thread looks something like:
while(1) {
simulation.update();
SimState* s = new SimState;
simulation.getAgents( s->agents ); // store agents
// store other things to SimState here..
stateStore.enqueue(s); // stateStore is a QQueue<SimState*>
if( /* some threshold reached */ )
// push stateStore
}
SimState looks like:
struct SimState {
std::vector<Agent> agents;
//other stuff here
};
And Simulation::getAgents looks like:
void Simulation::getAgents(std::vector<Agent> &a) const
{
// mAgents is a std::vector<Agent>
std::vector<Agent> a_tmp(mAgents);
a.swap(a_tmp);
}
The Agents themselves are somewhat complex classes. The members are a bunch of ints and floats and two std::vector<float>s.
With this current setup the sim can't crunch must faster than the GUI thread is drawing. I've verified that the current bottleneck is simulation.getAgents( s->agents ), because even if I leave out the push the updates-per-second are slow. If I comment out that line I see several orders of magnitude improvement in updates/second.
So, what sorts of containers should I be using to store the simulation's state? I know there is a bunch of copying going on atm, but some of it is unavoidable. Should I store Agent* in the vector instead of Agent ?
Note: In reality the simulation isn't in a loop, but uses Qt's QMetaObject::invokeMethod(this, "doSimUpdate", Qt::QueuedConnection); so I can use signals/slots to communicate between the threads; however, I've verified a simpler version using while(1){} and the issue persists.
Try re-using your SimState objects (using some kind of pool mechanism) instead of allocating them every time. After a few simulation loops, the re-used SimState objects will have vectors that have grown to the needed size, thus avoiding reallocation and saving time.
An easy way to implement a pool is to initially push a bunch of pre-allocated SimState objects onto a std::stack<SimState*>. Note that a stack is preferable to a queue, because you want to take the SimState object that is more likely to be "hot" in the cache memory (the most recently used SimState object will be at the top of the stack). Your simulation queue pops SimState objects off the stack and populates them with the computed SimState. These computed SimState objects are then pushed into a producer/consumer queue to feed the GUI thread. After being rendered by the GUI thread, they are pushed back onto the SimState stack (i.e. the "pool"). Try to avoid needless copying of SimState objects while doing all this. Work directly with the SimState object in each stage of your "pipeline".
Of course, you'll have to use the proper synchronization mechanisms in your SimState stack and queue to avoid race conditions. Qt might already have thread-safe stacks/queues. A lock-free stack/queue might speed things up if there is a lot of contention (Intel Thread Building Blocks provides such lock-free queues). Seeing that it takes on the order of 1/50 seconds to compute a SimState, I doubt that contention will be a problem.
If your SimState pool becomes depleted, then it means that your simulation thread is too "far ahead" and can afford to wait for some SimState objects to be returned to the pool. The simulation thread should block (using a condition variable) until a SimState object becomes available again in the pool. The size of your SimState pool corresponds to how much SimState can be buffered (e.g. a pool of ~50 objects gives you a crunch-ahead time of up to ~1 seconds).
You can also try running parallel simulation threads to take advantage of multi-core processors. The Thread Pool pattern can be useful here. However, care must be taken that the computed SimStates are enqueued in the proper order. A thread-safe priority queue ordered by time-stamp might work here.
Here's a simple diagram of the pipeline architecture I'm suggesting:
(Right-click and select view image for a clearer view.)
(NOTE: The pool and queue hold SimState by pointer, not by value!)
Hope this helps.
If you plan to re-use your SimState objects, then your Simulation::getAgents method will be inefficient. This is because the vector<Agent>& a parameter is likely to already have enough capacity to hold the agent list.
The way you're doing it now would throw away this already allocated vector and create a new one from scratch.
IMO, your getAgents should be:
void Simulation::getAgents(std::vector<Agent> &a) const
{
a = mAgents;
}
Yes, you lose exception safety, but you might gain performance (especially with the reusable SimState approach).
Another idea: You could try making your Agent objects fixed-size, by using a c-style array (or boost::array) and "count" variable instead std::vector for Agent's float list members. Simply make the fixed-size array big enough for any situation in your simulation. Yes, you'll waste space, but you might gain a lot of speed.
You can then pool your Agents using a fixed-size object allocator (such as boost::pool) and pass them around by pointer (or shared_ptr). That'll eliminate a lot of heap allocation and copying.
You can use this idea alone or in combination with the above ideas. This idea seems easier to implement than the pipeline thing above, so you might want to try it first.
Yet another idea: Instead of a thread pool for running simulation loops, you can break down the simulation into several stages and execute each stage in it's own thread. Producer/consumer queues are used to exchange SimState objects between stages. For this to be effective, the different stages need to have roughly similar CPU workloads (otherwise, one stage will become the bottleneck). This is a different way to exploit parallelism.