Minimal latency objects pooling technique in multithread application - c++

In the application we have something about 30 types of objects that are created repeatedly.
Some of them have long life (hours) some have short (milliseconds).
Objects could be created in one thread and destroyed in another.
Does anybody have any clue what could be good pooling technique in the sense of minimal creation/destruction latency, low lock contention and reasonable memory utilization?
Append 1.
1.1. Object pool/memory allocations for one type usually is not related to another type (see 1.3 for an exception)
1.2. Memory allocation is performed for only one type (class) at time, usually for several objects at time.
1.3. If a type aggregates another type using pointer (for some reason) these types allocated together in the one continuous piece of memory.
Append 2.
2.1. Using a collection with access serialization per type is known to be worse than new/delete.
2.2. Application is used on different platforms/compilers and cannot use compiler/platform specific tricks.
Append 3.
It becomes obvious that the fastest (with lowest latency) implementation should organize object pooling as star-like factories network. Where the central factory is global for other thread specific factories. Regular object provision/recycling is more effective to do in a thread specific factory while the central factory could be used for object balancing between threads.
3.1. What is the most effective way to organize communications between the central factory and thread specific factories?

I assume you have profile and measured your code after doing all that creation and verified that create/destroy is actually causing an issue. Else this is what you should do first.
If you still want to do the object pooling, as a first step, you should ensure your objects are stateless coz, that would be the prerequisite for reusing an object. Similarly you should ensure the members of the object and the object itself has no issue with being used from a different threads other than the one which created it. (COM STA objects / window handles etc)
If you use windows and COM, one way to use system provided pooling would be to write Free Threaded objects and enable object pooling, which will make the COM+ run time (earlier known as MTS) do this for you. If you use some other platform like Java perhaps you could use application servers that define interfaces that your objects should implement and the COM+ server could do the pooling for you.
or you could roll your own code. But you should try to find if there is pattern for this and if yes use that instead of what follows below
If you need to roll your own code, create a dynamically growable collection which tracks the objects already created. Use a vector preferrably for the collection since you would only be adding to the collection and it would be easy to traverse it searching for a free object. (assuming you do not delete objects in pool). Change the collection type according to your delete policies (vector of pointers/references to objects if you are using C++ so that delete and recreate an object at the same location)
Each object should be tracked using a flag which can be read in a volatile manner and changed using an interlock function to mark it as being used/ not used.
If all objects are used, you need to create a new object and add it to the collection. Before adding, you can acquire a lock (critical section), mark the new object as being used and exit the lock.
Measure and proceed - probably if you implemented the above collection as a class you could easily create different collections for different object types so as to reduce lock contention from threads that do different work.
Finally you could implement an overloaded class factory interface that can create all kinds of pooled objects and knows which collection holds which class
You could then optimize on this design from there.
Hope that helps.

To minimize construct/destruct latency, you need fully constructed objects at hand, so you will eliminate the new/ctor/dtor/delete time. These "free" objects can be kept in a list so you just pop/push the element at the end.
You may lock the object pools (one for each type) one by one. It is a bit more efficient than a system-wide lock, but does not have the overhead of a by-object locking.

If you haven't looked at tcmalloc, you might want to take a look. Basing your implementation off of its concepts might be a good start. Key points:
Determine a set of size classes. (Each allocation will be fulfilled by using an entry from an equal or greater sized allocation.)
Use one size-class per page. (All instances in a page are the same size.)
Use per-thread freelists to avoid atomic operations on every alloc/dealloc
When a per-thread freelist is too large, move some of the instances back to the central freelist. Try to move back allocations from the same page.
When a per-thread freelist is empty, take some from the central freelist. Try to take contiguous entries.
Important: You probably know this, but make sure your design will minimize false sharing.
Additional things you can do that tcmalloc can't:
Try to enable locality of reference by using finer-grained allocation pools. For example, if a few thousand objects will be accessed together, then it is best if they are close together in memory. (To minimize cache missed and TLB faults.) If you allocate these instances from their own threadcache, then they should have fairly good locality.
If you know in advance which instances will be long-lived and which will not, then allocate them from separate thread caches. If you do not know, then periodically copy the old instances using a threadcache for allocation and update old references to the new instances.

If you have some guess of the preferred size of the pool you can create fixed size pool using stack structure using array (the fastest possible solution). Then you need to implement four phases of object life time hard initialization (and memory allocation), soft initialization, soft cleanup and hard cleanup (and memory release). Now in pseudo code:
Object* ObjectPool::AcquireObject()
{
Object* object = 0;
lock( _stackLock );
if( _stackIndex )
object = _stack[ --_stackIndex ];
unlock( _stackLock );
if( !object )
object = HardInit();
SoftInit( object );
}
void ObjectPool::ReleaseObject(Object* object)
{
SoftCleanup( object );
lock( _stackLock );
if( _stackIndex < _maxSize )
{
object = _stack[ _stackIndex++ ];
unlock( _stackLock );
}
else
{
unlock( _stack );
HardCleanup( object );
}
}
HardInit/HardCleanup method performs full object initialization and destruction and they are executed only if the pool is empty or if the freed object cannot fit the pool because it is full. SoftIniti performs soft initialization of objects, it initializes only those aspect of objects that can be changed since it was released. SoftCleanup method free resources used by the object which should be freed as fast as possible or those resources which can become invalid during the time its owner resides in the pool. As you can see locking is minimal, only two lines of code (or only few instructions).
These four methods can be implemented in separate (template) classes so you can implement fine tuned operations per object type or usage. Also you may consider using smart pointers to automatically return object to its pool when it is no longer needed.

Have you tried the hoard allocator? It provides better performance than the default allocator on many systems.

Why do you have multiple threads destroying objects they did not create? It's a simple way to handle object lifetime, but the costs can vary widely depending on use.
Anyways, if you haven't started implementing this yet, at the very least you can put the create/destroy functionality behind an interface so that you can test/change/optimize this at a later date when you have more information about what your system actually does.

Related

Can serialization be replaced by a simple static_cast to retrieve data shared between two processes in C++?

Let's say I have a process written in C++ that has many instances of class A.
class A {
std::vector<std::vector<short>> tab;
/* other random data */
}
Once created, these objects should be accessed (read-only) by other processes of a different application, also in C++.
I would like to avoid at all cost making copies or moving the objects as it would significantly increase the memory consumption and probably take more time.
A 'simple' portable solution would be to serialize the objects in a shared memory and then when asked for the data, the process would just give the location of the various instances of class A in memory, and the second process would deserialize them before being able to read the data.
That means that we would create a copy each time a process wants to read the data. This is what I would like to avoid.
Given that both processes are written in C++ and both know the definition of the class A, is it possible to avoid the serialization and therefore copy/displacement of the data ? Of course it would not be portable anymore but it does not need to be.
Can a simple static_cast through the shared memory allow the second process to read the data in it as its own without making any processing of any kind therefore costing no time and no additional memory at all ?
If not, is there a simpler form of serialization adding just an overhead that would allow the second process to understand and read the data without having to make a copy ?

How to manage millions of game objects with a Slot Map / Object Pool pattern in C++?

I'm developing a game server for a video game called Tibia.
Basically, there can be up to millions of objects, of which there can be up to thousands of deletes and re-creations as players interact with the game world.
The thing is, the original creators used a Slot Map / Object Pool on which pointers are re-used when an object is removed. This is a huge performance boost since there's no need to do much memory reallocation unless needed.
And of course, I'm trying to accomplish that myself, but I've come into one huge problem with my Slot Map:
Here's just a few explanation of how Slot Map works according to a source I found online:
Object class is the base class for every game object, my Slot Map / object Pool is using this Object class to save every allocated object.
Example:
struct TObjectBlock
{
Object Object[36768];
};
The way the slot map works is that, the server first allocates, say, 36768 objects in a list of TObjectBlock and gives them a unique ID ObjectID for each Object which can be re-used in a free object list when the server needs to create a new object.
Example:
Object 1 (ID: 555) is deleted, it's ID 555 is put in a free object ID
list, an Item creation is requested, ID 555 is reused since it's on
the free object list, and there is no need to reallocate another
TObjectBlock in the array for further objects.
My problem: How can I use "Player" "Creature" "Item" "Tile" to support this Slot Map? I don't seem to come up with a solution into this logic problem.
I am using a virtual class to manage all objects:
struct Object
{
uint32_t ObjectID;
int32_t posx;
int32_t posy;
int32_t posz;
};
Then, I'd create the objects themselves:
struct Creature : Object
{
char Name[31];
};
struct Player : Creature
{
};
struct Item : Object
{
uint16_t Attack;
};
struct Tile : Object
{
};
But now if I was to make use of the slot map, I'd have to do something like this:
Object allocatedObject;
allocatedObject.ObjectID = CreateObject(); // Get a free object ID to use
if (allocatedObject.ObjectID != INVALIDOBJECT.ObjectID)
{
Creature* monster = new Creature();
// This doesn't make much sense, since I'd have this creature pointer floating around!
monster.ObjectID = allocatedObject.ObjectID;
}
It pretty much doesn't make much sense to set a whole new object pointer the already allocated object unique ID.
What are my options with this logic?
I believe you have a lot of tangled concepts here, and you need to detangle them to make this work.
First, you are actually defeating the primary purpose of this model. What you showed smells badly of cargo cult programming. You should not be newing objects, at least without overloading, if you are serious about this. You should allocate a single large block of memory for a given object type and draw from that on "allocation" - be it from an overloaded new or creation via a memory manager class. That means you need separate blocks of memory for each object type, not a single "objects" block.
The whole idea is that if you want to avoid allocation-deallocation of actual memory, you need to reuse the memory. To construct an object, you need enough memory to fit it, and your types are not the same length. Only Tile in your example is the same size as Object, so only that could share the same memory (but it shouldn't). None of the other types can be placed in the objects memory because they are longer. You need separate pools for each type.
Second, there should be no bearing of the object ID on how things are stored. There cannot be, once you take the first point into consideration, if the IDs are shared and the memory is not. But it must be pointed out explicitly - the position in a memory block is largely arbitrary and the IDs are not.
Why? Let's say you take object 40, "delete" it, then create a new object 40. Now let's say some buggy part of the program referenced the original ID 40. It goes looking for the original 40, which should error, but instead finds the new 40. You just created an entirely untrackable error. While this can happen with pointers, it is far more likely to happen with IDs, because few systems impose checks on ID usage. A main reason for indirecting access with IDs is to make access safer by making it easy to catch bad usage, so by making IDs reusable, you make them just as unsafe as storing pointers.
The actual model for handling this should look like how the operating system does similar operations (see below the divide for more on that...). That is to say, follow a model like this:
Create some sort of array (like a vector) of the type you want to store - the actual type, not pointers to it. Not Object, which is a generic base, but something like Player.
Size that to the size you expect to need.
Create a stack of size_t (for indexes) and push into it every index in the array. If you created 10 objects, you push 0 1 2 3 4 5 6 7 8 9.
Every time you need an object, pop an index from the stack and use the memory in that cell of the array.
If you run out of indexes, increase the size of the vector and push the newly created indexes.
When you use objects, indirect via the index that was popped.
Essentially, you need a class to manage the memory.
An alternative model would be to directly push pointers into a stack with matching pointer type. There are benefits to that, but it is also harder to debug. The primary benefit to that system is that it can easily be integrated into existing systems; however, most compilers do similar already...
That said, I suggest against this. It seems like a good idea on paper, and on very limited systems it is, but modern operating systems are not "limited systems" by that definition. Virtual memory already resolves the biggest reason to do this, memory fragmentation (which you did not mention). Many compiler allocators will attempt to more or less do what you are trying to do here in the standard library containers by drawing from memory pools, and those are far more manageable to use.
I once implemented a system just like this, but for many good reasons have ditched it in favor of a collection of unordered maps of pointers. I have plans to replace allocators if I discover performance or memory problems associated with this model. This lets me offset the concern of managing memory until testing/optimization, and doesn't require quirky system design at every level to handle abstraction.
When I say "quirky", believe me when I say that there are many more annoyances with the indirection-pool-stack design than I have listed.

Prototype Pattern

According to wikipedia prototype pattern is :
The prototype pattern is a creational design pattern used in software development when the type of objects to create is determined by a prototypical instance, which is cloned to produce new objects. This pattern is used to:
Avoid subclasses of an object creator in the client application, like the abstract factory pattern does.
Avoid the inherent cost of creating a new object in the standard way (e.g., using the new keyword) when it is prohibitively expensive for a given application.
I saw certain demo codes of this pattern in C++ all of them are using copy constructor.
Can anyone explain how point number two applies(in general as well as in context of C++) as we are using copy constructor anyways in clone function. If it can be done without copy constructor then an example code snippet would be great.
You can copy without dynamic allocation. For example, here's a cloning that only happens in a local scope:
Foo prototype;
void local()
{
Foo x = prototype; // first copy
x.mutate();
Foo y = x; // another copy
}
No dynamic allocation is used, ever.
It is true that return new Foo(*this); also makes a copy, but what's more important is that that object is allocated dynamically. That's the cost that your article to which is alluding.
In a game I've been making in Java, I ran into an interesting situation that fit the bill of a prototype pattern quite well. You see, I had this Animation object that stored a container of images to flip through, as well as some other data that tracked how long since the last frame was rendered, which frame it was on, if the animation was running or not, etc.
I found that for multiple characters to use the same Animation object was causing problems. If two characters shared an animation, they would turn on and turn off the animation at conflicting times for each other. I would have guys standing still with walking animations, or moving with standing animations. Creation of the animation objects were costly and time consuming what with creating the sprites, setting the ammount of time they would display for, creating an interval queue of images, etc.
Instead, I made the Animation object a prototype object. If an Animation clones itself, It shares the original collection of frames with all other animations since those are immutable, but also expensive to construct. Instead the new objects would share this immutable base, but have their own information of which frame to draw and when.
Think of it like a projector. When it get's cloned, the new projector might have it's own information on if it's running or not, which frame it's on, etc, but it may be using the same piece of film that the original projector is using. The reason why they don't trip each other up is that the film is immutable. (and expensive to create)
In all honesty, the usage of the prototype in this manner is a great way to implement a flyweight pattern. Objects that share Objects that are expensive to create. If you "clone" them, they would be instantiated with their new transient state, but still share those expensive base objects with it's creator.
Calling the copy constructor for object which isn't use dynamic memory inside itself is much more faster then perform any allocation in dynamic memory via new. Because allocation in dynamic memory is a kind of system call.

Moving a STL object between processes

I know this is strange but I'm just having fun.
I am attempting to transmit a std::map (instantiated using placement new in a fixed region of memory) between two processes via a socket between two machines: Master and Slave. The map I'm using has this typedef:
// A vector of Page objects
typedef
std::vector<Page*,
PageTableAllocator<Page*> >
PageVectorType;
// A mapping of binary 'ip address' to a PageVector
typedef
std::map<uint32_t,
PageVectorType*,
std::less<uint32_t>,
PageTableAllocator<std::pair<uint32_t, PageVectorType*> > >
PageTableType;
The PageTableAllocator<T> class is responsible for allocating whatever memory the STL containers may want/need into a fixed location in memory. E.g., all Page objects and STL internal structures are being instantiated in this fixed memory region. This ensures that both the std::map object and the allocator are both placed in a fixed region of memory. I've used GDB to make sure the map and allocator behave correctly (all memory used is in the fixed region, nothing ever goes on the application's normal heap).
Assuming Master starts up, initializes all of it's STL structures and the special memory region, the following happens. Slave starts, prints out its version of the page table, then looks for a Master. Slave finds a master, deletes its version of the page table, copies Master's version of the page table (and the special memory region), and successfully prints it out the Master's version of the page table. From what prodding I've done in GDB I can perform many read-only operations.
When trying to add to the newly copied PageTableType object, Slave faults in the allocator's void construct (pointer p, const T& value) method. The value passed in as p points to an already allocated area of memory (as per Master's version of the std::map).
I don't know anything about C++ object structure, but I'm guessing that object state from Slave's version of the PageTableType must be hanging around even after I replace all of the memory that the PageTableType and its allocator used. My question is if this is a valid concern. Does C++ maintain some sort of object state outside of the area of memory that object was instantiate din?
All of the objects used in the map are non-POD. Same is true for the allocator.
To answer your specific question:
Does C++ maintain some sort of object state outside of the area of memory that object was instantiated in?
The answer is no. There are no other data structures set up to "track" objects or anything of the sort. C++ uses an explicit memory allocation model, so if you choose to be responsible for allocation and deallocation, then you have complete control.
I suspect there's something wrong in your code somewhere, but since you believe the code is correct you're inventing some other reason why your code might be failing, and following that path instead. I would pull back, and carefully examine everything about the way your code is working right now, and see if you can determine the problem. Although the STL classes are complex (especially std::map), they're ultimately just code and there is no hidden magic in there.

Boost shared_ptr use_count function

My application problem is the following -
I have a large structure foo. Because these are large and for memory management reasons, we do not wish to delete them when processing on the data is complete.
We are storing them in std::vector<boost::shared_ptr<foo>>.
My question is related to knowing when all processing is complete. First decision is that we do not want any of the other application code to mark a complete flag in the structure because there are multiple execution paths in the program and we cannot predict which one is the last.
So in our implementation, once processing is complete, we delete all copies of boost::shared_ptr<foo>> except for the one in the vector. This will drop the reference counter in the shared_ptr to 1. Is it practical to use shared_ptr.use_count() to see if it is equal to 1 to know when all other parts of my app are done with the data.
One additional reason I'm asking the question is that the boost documentation on the shared pointer shared_ptr recommends not using "use_count" for production code.
Edit -
What I did not say is that when we need a new foo, we will scan the vector of foo pointers looking for a foo that is not currently in use and use that foo for the next round of processing. This is why I was thinking that having the reference counter of 1 would be a safe way to ensure that this particular foo object is no longer in use.
My immediate reaction (and I'll admit, it's no more than that) is that it sounds like you're trying to get the effect of a pool allocator of some sort. You might be better off overloading operator new and operator delete to get the effect you want a bit more directly. With something like that, you can probably just use a shared_ptr like normal, and the other work you want delayed, will be handled in operator delete for that class.
That leaves a more basic question: what are you really trying to accomplish with this? From a memory management viewpoint, one common wish is to allocate memory for a large number of objects at once, and after the entire block is empty, release the whole block at once. If you're trying to do something on that order, it's almost certainly easier to accomplish by overloading new and delete than by playing games with shared_ptr's use_count.
Edit: based on your comment, overloading new and delete for class sounds like the right thing to do. If anything, integration into your existing code will probably be easier; in fact, you can often do it completely transparently.
The general idea for the allocator is pretty much the same as you've outlined in your edited question: have a structure (bitmaps and linked lists are both common) to keep track of your free objects. When new needs to allocate an object, it can scan the bit vector or look at the head of the linked list of free objects, and return its address.
This is one case that linked lists can work out quite well -- you (usually) don't have to worry about memory usage, because you store your links right in the free object, and you (virtually) never have to walk the list, because when you need to allocate an object, you just grab the first item on the list.
This sort of thing is particularly common with small objects, so you might want to look at the Modern C++ Design chapter on its small object allocator (and an article or two since then by Andrei Alexandrescu about his newer ideas of how to do that sort of thing). There's also the Boost::pool allocator, which is generally at least somewhat similar.
If you want to know whether or not the use count is 1, use the unique() member function.
I would say your application should have some method that eliminates all references to the Foo from other parts of the app, and that method should be used instead of checking use_count(). Besides, if use_count() is greater than 1, what would your program do? You shouldn't be relying on shared_ptr's features to eliminate all references, your application architecture should be able to eliminate references. As a final check before removing it from the vector, you could assert(unique()) to verify it really is being released.
I think you can use shared_ptr's custom deleter functionality to call a particular function when the last copy has been released. That way, you're not using use_count at all.
You would need to hold something other than a copy of the shared_ptr in your vector so that the shared_ptr is only tracking the outstanding processing.
Boost has several examples of custom deleters in the shared_ptr docs.
I would suggest that instead of trying to use the shared_ptr's use_count to keep track, it might be better to implement your own usage counter. this way you will have full control over this rather than using the shared_ptr's one which, as you rightly suggest, is not recommended. You can also pre-set your own counter to allow for the number of threads you know will need to act on the data, rather than relying on them all being initialised at the beginning to get their copies of the structure.