I have small class Entity with some int fields and field that is two dimensional array of 50 ints. Nothing special.
I generate a lot (millions of such entities), each entity is differ: the array is differs and fields are differs.
For my surprise I found that it is > 2x faster to not create each time new entity and just reuse existent and just set to 0 it
fields and array. Is the memory initialization/deletion so time-consuming?
There is overhead associated with the memory management of objects. This can result in slowdowns.
The best way to know is to time it, as you have done.
Sometimes it won't bother you, other times, you will be very sensitive to it.
Think about which loop would be faster:
while (/* not done */) {
Ask system for memory
Create object
Write into object
Destroy object
}
or
while (/* not done */) {
Write into object
}
Related
I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.
Note: Although this question doesn't directly correlate to games, I've molded the context around game development in order to better visualize a scenario where this question is relevant.
tl;dr: Is rapidly creating objects of the same class memory intensive, inefficient, or is it common practice?
Say we have a "bullet" class - and an instance of said class is created every time the player 'shoots' - anywhere between 1 and 10 times every second. These instances may be destroyed (obviously) upon collision.
Would this be a terrible idea? Is general OOP okay here (i.e.: class Bullet { short x; short y; }, etc.) or is there a better way to do this? Is new and delete preferred?
Any input much appreciated. Thank you!
This sounds like a good use-case for techniques like memory-pools or Free-Lists. The idea in both is that you have memory for a certain number of elements pre-allocated. You can override the new operator of your class to use the pool/list or use placement new to instantiate your class in a retrieved address.
The advantages:
no memory fragmentation
pretty quick
The disadvantages:
you must know the maximum number of elements beforehand
Don't just constantly create and delete objects. Instead, an alternative is to have a constant, resizable array or list of object instances that you can reuse. For example, create an array of 100 bullets, they don't all have to be drawn, have a boolean that states whether they are "active" or not.
Then whenever you need a new bullet, "activate" an inactive bullet and set its position where you need it. Then whenever it is off screen, you can mark it inactive again and not have to delete it.
If you ever need more than 100 bullets, just expand the array.
Consider reading this article to learn more: Object Pool. It also has several other game pattern related topics.
The very very least that happens when you allocate an object is a function call (its constructor). If that allocation is dynamic, there is also the cost of memory management which at some point could get drastic due to fragmentation.
Would calling some function 10 times a second be really bad? No. Would creating and destroying many small objects dynamically 10 times a second be bad? Possibly. Should you be doing this? Absolutely not.
Even if the performance penalty is not "felt", it's not ok to have a suboptimal solution while an optimal one is immediately available.
So, instead of for example a std::list of objects that are dynamically added and removed, you can simply have a std::vector of bullets where addition of bullets means appending to the vector (which after it has reached a large enough size, shouldn't require any memory allocation anymore) and deleting means swapping the element being deleted with the last element and popping it from the vector (effectively just reducing the vector size variable).
Think that with every instantiation there is a place in the heap where it need to allocate memory and create a new instance. This may affect the performance. Try using collection and create an instance of that collection with how many bullets you want.
I have a class A whose objects are created dynamically:
A *object;
object = new A;
there will be many objects of A in an execution.
i have created a method in A to return the address of a particular object depending on the passed id.
A *A::get_obj(int id)
implemetation of get_obj requires itteration so i chose vectors to store the addresses of the objects.
vector<A *> myvector
i think another way to do this is by creating a file & storing the address as a text on a particular line (this will be the id).
this will help me reduce memory usage as i will not create a vector then.
what i dont know is that will this method consume more time than the vector method?
any other option of doing the same is welcome.
Don't store pointers in files. Ever. The objects A are taking up more space than the pointers to them anyway. If you need more A's than you can hold onto at one time, then you need to create them as needed and serialize the instance data to disk if you need to get them back later before deleting, but not the address.
will this method consume more time than the vector method?
Yes, it will consume a lot more time - every lookup will take several thousand times longer. This does not hurt if lookups are rare, but if they are frequent, it could be bad.
this will help me reduce memory usage
How many object are you expecting to manage this way? Are you certain that memory-usage will be a problem?
any other option of doing the same is welcome
These are your two options, really. You can either manage the list in memory, or on disk. Depending on your usage scenario, you can combine both methods. You could, for instance, keep frequently used objects in memory, and write infrequently used ones out to disk (this is basically caching).
Storing your data in a file will be considerably slower than in RAM.
Regarding the data structure itself, if you usually use all the ID's, that is if your vector usually has empty cells, then std::vector is probably the most suitable approach. But if your vector will have many empty cells, std::map may give you a better solution. It will consume less memory and give O(logN) access complexity.
The most important thing here, imho, is the size of your data set and your platform. For a modern PC, handling an in-memory map of thousands of entries is very fast, but if you handle gigabytes of data, you'd better store it in a real on-disk database (e.g. MySQL).
Profiling my code, i see a lot of cache misses and would like to know whether there is a way to improve the situation. Optimization is not really needed, I'm more curious about whether there exist general approaches to this problem (this is a follow up question).
// class to compute stuff
class A {
double compute();
...
// depends on other objects
std::vector<A*> dependencies;
}
I have a container class that stores pointers to all created objects of class A. I do not store copies as I want to have shared access. Before I was using shared_ptr, but as single As are meaningless without the container, raw pointers are fine.
class Container {
...
void compute_all();
std::vector<A*> objects;
...
}
The vector objects is insertion sorted in a way that the full evaluation can be done by simply iterating and calling A.compute(), all dependencies in A are resolved.
With a_i objects of class A, the evaluation might look like this:
a_1 => a_2 => a_3 --> a_2 --> a_1 => a_4 => ....
where => denotes iteration in Container and --> iteration over A::dependencies
Moreover, the Container class is created only once and compute_all() is called many times, so rearranging the whole structure after creation is an option and wouldn't harm efficiency much.
Now to the observations/questions:
Obviously, iterating over Container::objects is cache efficient, but accessing the pointees is definitely not.
Moreover, as each object of type A has to iterate over A::dependencies, which again can produces cache misses.
Would it help to create a separate vector<A*> from all needed object in evaluation order such that dependencies in A are inserted as copies?
Something like this:
a_1 => a_2 => a_3 => a_2_c => a_1_c => a_4 -> ....
where a_i_c are copies from a_i.
Thanks for your help and sorry if this question is confusing, but I find it rather difficult to extrapolate from simple examples to large applications.
Unfortunately, I'm not sure if I'm understanding your question correctly, but I'll try to answer.
Cache misses are caused by the processor requiring data that is scattered all over memory.
One very common way of increasing cache hits is just organizing your data so that everything that is accessed sequentially is in the same region of memory. Judging by your explanation, I think this is most likely your problem; your A objects are scattered all over the place.
If you're just calling regular new every single time you need to allocate an A, you'll probably end up with all of your A objects being scattered.
You can create a custom allocator for objects that will be creating many times and accessed sequentially. This custom allocator could allocate a large number of objects and hand them out as requested. This may be similar to what you meant by reordering your data.
It can take a bit of work to implement this, however, because you have to consider cases such as what happens when it runs out of objects, how it knows which objects have been handed out, and so on.
// This example is very simple. Instead of using new to create an Object,
// the code can just call Allocate() and use the pointer returned.
// This ensures that all Object instances reside in the same region of memory.
struct CustomAllocator {
CustomAllocator() : nextObject(cache) { }
Object* Allocate() {
return nextObject++;
}
Object* nextObject;
Object cache[1024];
}
Another method involves caching operations that work on sequential data, but aren't performed sequentially. I think this is what you meant by having a separate vector.
However, it's important to understand that your CPU doesn't just keep one section of memory in cache at a time. It keeps multiple sections of memory cached.
If you're jumping back and forth between operations on data in one section and operations on data in another section, this most likely will not cause many cache hits; your CPU can and should keep both sections cached at the same time.
If you're jumping between operations on 50 different sets of data, you'll probably encounter many cache misses. In this scenario, caching operations would be beneficial.
In your case, I don't think caching operations will give you much benefit. Ensuring that all of your A objects reside in the same section of memory, however, probably will.
Another thing to consider is threading, but this can get pretty complicated. If your thread is doing a lot of context switches, you may encounter a lot of cache misses.
+1 for profiling first :)
While using a cusomt allocator can be the correct solution, I'd certainly recommend two things first:
keep a reference/pointer to the entire vector of A instead of a vector of A*:
.
class Container {
...
void compute_all();
std::vector<A>* objects;
...
}
Use a standard library with custom allocators (I think boost has some good ones, EASTL is centered around the very concept)
$0.02
Is there any pattern how to deal with a lot of object instantiations (40k per second) on a mobile device? I need these objects separately and they cannot be combined. A reusage of objects would probably be a solution. Any hints?
Yes. Keep old objects in a pool and re-use them, if you can.
You will save massive amounts of time due to the cost of memory allocation and deletion.
I think you could consider these design patterns:
Object Pool
Factory
Further info
I hope this help you too: Object Pooling for Generic C++ classes
If the objects are all the same size, try a simple cell allocator with an intrusive linked list of free nodes:
free:
add node to head of list
allocate:
if list is non-empty:
remove the head of the list and return it
else:
allocate a large block of memory
split it into cells of the required size
add all but one of them to the free list
return the other one
If allocation and freeing are all done in a single thread, then you don't need any synchronisation. If they're done in different threads, then possibly 40k context switches per second is a bigger worry than 40k allocations per second ;-)
You can make the cells be just "raw memory" (and either use placement new or overload operator new for your class), or else keep the objects initialized at all times, even when they're on the "free list", and assign whatever values you need to the members of "new" ones. Which you do depends how expensive initialization is, and probably is the technical difference between a cell allocator and an object pool.
You might be able to use the flyweight pattern if your objects are redundant. This pattern shares memory amongst similar objects. The classical example is the data structure used for graphical representation of characters in a word processing program.
Wikipedia has a summary.
There is an implementation in boost.
Hard to say exactly how to improve your code without more information, but you probably want to check out the Boost Pool libraries. They all provide different ways of quickly allocating memory for different, specific use cases. Choose the one that fits your use case best.
If the objects are the same size, you can allocate a large chunk of memory and use placement new, that will help with the allocate cost as it will all be in contiguous memory:
Object *pool = malloc( sizeof(Object) * numberOfObjects );
for(int i=0; i<numberOfObjects; i++)
new (&pool[i]) Object()
I've used similar patterns for programming stochastic reaction-diffusion systems (millions of object creations per second on a desktop computer) and for real-time image processing (again, hundreds of thousands or millions per second).
The basic idea is as follows:
Create an allocator that allocates large arrays of your desired object; require that this object have a "next" pointer (I usually create a template that wraps the object with a next pointer).
Every time you need an object, get one from this allocator (using the new-syntax that initializes from the block of memory you call).
Every time you're done, give it back to the allocator and place it on a stack.
The allocator gives you something off the stack if the stack is nonempty, or something from its array buffer otherwise. If you run out of buffer, you can either allocate another larger buffer and copy the existing used nodes, or have the allocator maintain a stack of fully-used allocation blocks.
When you are done with all the objects, delete the allocator. Side benefit: you don't need to be sure to free each individual object; they'll all go away. Side cost: you'd better be sure to allocate anything you want to preserve forever on the heap instead of in this temporary buffer (or have a permanent buffer you use).
I generally get performance about 10x better than raw malloc/new when using this approach.