I am building a modelling software I had a few questions about how to get the best performance ?
1) Should I use std::vector<class> or std::vector<class*> ?
My class is quite complicated / big , and I think using the second option is better , as since std::vector tries to allocate memory contiguously and there might not be a contiguous block of memory to store a million class, but when I just store pointers, the class does not have to be stored contiguously only the pointers have to stored and the computer might have space to do this. Is this reasoning correct?
2) As I said I will have millions of class, (for proper simulation I will need > billion of the class ) is inheritance a smart thing to use here ?
For my simulation , there are multiple different types which inherits from the same base class,
class A - class B
- class C
- class D
Should I avoid inheritance as I keep hearing that there is a performance penalty for using inheritance ?
3) Also how do I store all these different class in a std::vector ?
Can a std::vector<base_class * > or std::vector<base_class> store class B , class C , class D which all inherit from the base class ?
4) In the previous version of the program , I used multi threading by making the different process handle different sections of the std::vector , is there a better way to do the threading ?
5) Should I use smart pointers ? Since I have so many objects , will they degrade performance ?
I am in the planning stage and any help is greatly appreciated.
I deal with problems like this every day in a professional setting (I'm a C++ programmer by trade, dealing with big-data sets). As such what I'm about to say here is as much personal-advice as it is an answer. I won't go all out on the simple parts:
1 - Yes store pointers, it will be much faster than reallocation and move times than the full class-object.
2 - Yes, use inheritance if the objects have information in relation, I imagine in this case they most likely do as your considering it. If they don't, why would you store them together?
3 - Store them all using smart pointers to the base-class (the parent object, thus you can add a single virtual "get_type" function to return and enumeration, and convert to a child when you need to. This will save the overhead of providing multiple virtual-methods if you don't need child-data often.
4 - Arguable, but threading separate parts of a larger array is the simpler approach (and when your dealing with huge complexity of data, simpler is better.
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it? ~ Brian Kernighan
5 - There will be some small penalty for using smart pointers ( As explained in this question, however in my opinion that penalty (especially with the unique_ptr) is so small compared to the ease-of-use and loss of complexity, it's definitely worth it
And putting it all together:
class Abstract_Parent;
std::vector<std::unique_ptr<Abstract_Parent>> Data;
enum ChildType {Child_1 = 0, Child_2 = 1};
class Abstract_Parent
{
public:
virtual ChildType GetType() = 0;
}
class Child_One
{
public:
virtual ChildType GetType() { return Child_1; }
}
class Child_Two
{
public:
virtual ChildType GetType() { return Child_2; }
}
void Some_Function()
{
//this is how to insert a child-object
std::unique_ptr<Abstract_Parent> Push_me_Back(new Child_One());
Data.Push_Back(std::move(Push_me_Back));
if(Data[0]->GetType() == Child_1)
{
Child_1 *Temp_Ptr = dynamic_cast<Child_One*> Data[0];
Temp_Ptr->Do_Something_Specific();
}
}
1.) That depends on your use case. You will use a pointer if you want to access object through a base class pointer. On the other side you lose the advantage of continuous memory and cache locality of code and data.
2.) If you need 1 billion instance then every additional data per object will increase you memory footprint. For example an additional pointer to your virtual function table (vptr) of 8 bytes will increase your memory requirements by 8 GBytes. Storing every type in a different vector without a virtual base class does not have this overhead.
2b) Yes you should avoid inheritance with virtual function if you aim for performance. The instruction cache will be trashed if virtual function are called with different implementations. At least you can sort your big vector by type to minimize this problem.
3.) You must use the pointer option to prevent slicing if you go for a base class with virtual functions.
4.) More information is needed and should be answered in separate question.
5.) Every indirection will degrade performance.
1) Should I use std::vector<class> or std::vector<class*> ?
False dicotomy. There are a couple of other options:
boost::ptr_vector<class>
std::vector<std::unique_ptr<class>>
Probably even more.
Personally I like boost::ptr_vector<class> as it stores an owned pointer (thus memory allocation is done automatically). But when accessing members they are returned as reference to the object (not pointers). Thus using them with standard algorithms is vastly simplified over other techniques.
My class is quite complicated / big , and I think using the second option is better , as since std::vector tries to allocate memory contiguously and there might not be a contiguous block of memory to store a million class,
The real question here is if you can pre-calculate the maximum size of your vector and reserve() the required amount of space. If you can do this (and thus avoid any cost of copying) std::vector<class> would be the best solution.
This is because having the objects in contiguous storage is usually a significant advantage in terms of speed (especially when scanning a vector). The ability to do this should not be underestimated when you have huge datasets (especially in the billion range).
but when I just store pointers, the class does not have to be stored contiguously only the pointers have to stored and the computer might have space to do this. Is this reasoning correct?
By using pointers, you are also significantly increasing the amount of memory required by the application as you need to store the object and the pointer to the object. Over billions of objects this can be a significant cost.
2) As I said I will have millions of class, (for proper simulation I will need > billion of the class ) is inheritance a smart thing to use here ?
Impossible to say without much more information.
3) Also how do I store all these different class in a std::vector ? Can a std::vector or std::vector store class B , class C , class D which all inherit from the base class ?
But if you do use inheritance you will need not be able to use std::vector<class> directly. You will need to store a pointer to the base class. But that does not preclude the other three techniques.
4) In the previous version of the program , I used multi threading by making the different process handle different sections of the std::vector , is there a better way to do the threading ?
This seems a reasonable approach (assuming that the ranges don't overlap and are contiguous). Don't create more threads than you have available cores.
Should I use smart pointers ? Since I have so many objects , will they degrade performance ?
Use of unique_ptr over a normal pointer has zero overhead (assuming you don't use a custom deleter). The actual generated code will be basically equivalent.
Related
The problem:
I have a family of objects with a common base, and I need to be able to identify the specific concrete type via an integer value.
There are two obvious approaches to do that, however both come with unacceptable overheads in terms of memory or cpu time. Since the project deals with billions of objects, the tiniest of overhead ends up being heavily pronounced, and I have tested this, it is not a case of premature optimization. The operations involved in processing the objects are all trivial, and the overhead of virtual calls diminishes performance tremendously.
a pure virtual int type() function implemented for every type, unfortunately that comes with the overhead of a virtual call for something as trivial as returning a static integer value
a int type member for every instance, specified in the constructor type, which introduces a 4 byte overhead for each of those billions of objects, wasting memory, polluting the cache and whatnot
I remember some time ago someone asking about "static virtual member variables", and naturally the answers boiled down to "no, that makes no sense", however being able to put a user variable in the vtable and having the ability to set its value for each specific type seems to be a very efficient solution to my problem.
This way both of the above-mentioned overheads are avoided, no virtual calls are necessary and there is no per-instance memory overhead either. The only overhead is the indirection to get the vtable, but considering the frequency of access of that data, it will most likely be kept into the cpu cache most of the time.
My current obvious option is to do "manual OOP" - do vtables manually in order to incorporate the necessary "meta" data into them as well, init the vtable pointer for every type and use awkward syntax to invoke pseudo "member" functions. Or even omit the use of a vtable pointer altogether, and store the id instead, and use that as an index for a table of vtables, which will be even more efficient, as it will avoid the indirection, and will shrink the size down, as I only need 2^14 distinct types.
It would be nice if I can avoid reinventing the wheel. I am not picky about the solution as long as it can give me the efficiency guarantees.
Maybe there is a way to have my type id integer in the vtable, or maybe there is another way altogether, which is highly possible since I don't keep up with the trends, and C++ got a lot of new features in the last few years.
Naturally, those ids would need to be uniform and consistent, rather than some arbitrary values of whatever the compiler cooks up internally. If that wasn't a requirement, I'd just use the vtable pointer values for an even more efficient solution that avoids indirection.
Any ideas?
If you have way more instances than you have types, then the most straightforward solution is to abstract at the level of a homogeneous container rather than a single instance.
Instead of:
{PolymorphicContainer}: Foo*, Bar*, Baz*, Foo*, Bar*, Bar*, Baz*, ...
... and having to store some type information (vtable, type field, etc) to distinguish each element while accessing memory in the most sporadic ways, you can have:
{FooContainer}: Foo, Foo, Foo, Foo, Foo, ...
{BarContainer}: Bar, Bar, Bar, Bar, Bar, ...
{BazContainer}: Baz, Baz, Baz, Baz, Baz, ...
{PolymorphicContainer}: FooContainer*, BarContainer*, BazContainer*
And you store the type information (vtable or what not) inside the containers. That does mean you need access patterns of a kind that tend to be more homogeneous, but often such an arrangement can be made in most problems I've encountered.
Gamedevs used to do things like sort polymorphic base pointers by subtype while using a custom allocator for each to store them contiguously. That combination of sorting by base pointer address and allocating each type from separate pools makes it so you then get the analogical equivalent of:
Foo*, Foo*, Foo*, Foo*, ..., Bar*, Bar*, Bar*, Bar*, ..., Baz*, Baz*, Baz*, ...
With most of them stored contiguously because they each use a custom allocator which puts all the Foos into contiguous blocks separate from all the Bars, e.g. Then on top of spatial locality you also get temporal locality on the vtables if you access things in a sequential pattern.
But that's more painful to me than abstracting at the level of the container, and doing it that way still requires the overhead of two pointers (128-bits on 64-bit machines) per object (a vptr and a base pointer to the object itself). Instead of processing orcs, goblins, humans, etc, individually through a Creature* base pointer, it makes sense to me to store them in homogeneous containers, abstract that, and process Creatures* pointers which point to entire homogeneous collections. Instead of:
class Orc: public Creature {...};
... we do:
// vptr only stored once for all orcs in the entire game.
class Orcs: public Creatures
{
public:
// public interface consists predominantly of functions
// which process entire ranges of orcs at once (virtual
// dispatch only paid once possibly for a million orcs
// rather than a million times over per orc).
...
private:
struct OrcData {...};
std::vector<OrcData> orcs;
};
Instead of:
for each creature:
creature.do_something();
We do:
for each creatures:
creatures.do_something();
Using this strategy, if we need a million orcs in our video game, we'd cut the costs associated with virtual dispatch, vptrs, and base pointers to 1/1,000,000th of the original cost, not to mention you get very optimal locality of reference as well free of charge.
If in some cases we need to do something to a specific creature, you might be able to store a two-part index (might be able to fit it in 32-bits or maybe 48) storing creature type index and then relative creature index in that container, though this strategy is most beneficial when you don't have to call functions just to process one creature in your critical paths. Generally you can fit this into 32-bit indices or possibly 48-bits if you then set a limit for each homogeneous container of 2^16 before it is considered "full" and you create another one for the same type, e.g. We don't have to store all the creatures of one type in one container if we want to cram our indices.
I can't say for sure if this is applicable to your case because it depends on access patterns, but it is generally the first solution I consider when you have performance issues associated with polymorphism. The first way I look at it is that you're paying the costs like virtual dispatch, loss of contiguous access patterns, loss of temporal locality on vtables, memory overhead of vptr, etc. at too granular of a level. Make the design coarser (bigger objects, like objects representing a whole collection of things, not an individual object per thing) and the costs become negligible again.
Whatever the case may be, instead of thinking about this in terms of vtables and what not, think of it in terms of how you arrange data, just bits and bytes, so that you don't have to store a pointer or integer with every single little object. Draw things out just thinking about bits and bytes, not classes and vtables and virtual functions and nice public interfaces and so forth. Think about that later after you settle on a memory representation/layout, and start off just thinking about bits and bytes, like so:
I find this so much easier to think about for data-oriented designs with performance-critical needs well-anticipated upfront than trying to think about language mechanisms and nice interface designs and all that. Instead I think in a C-like way first of just bits and bytes and communicate and sketch my ideas as structs and figure out where the bits and bytes should go. Then once you figure that out, you can figure out how to put a nice interface on top.
Anyway, for avoiding overhead of type information per teeny object, that means grouping them together somehow in memory and storing that analogical type field once per group instead of once per element in group. Allocating elements of a particular type in a uniform way might also give you that information based on their pointer address or index, e.g. There are many ways to about this but just think about it in terms of data stored in memory as a general strategy.
The answer is somewhat embedded in your question topic:
Most efficient way to get an integer type id in a family of common
base types [...]
You store the integer ID once per family or at least once per multiple objects in that family instead of once per object. That's the only way, however you approach it, to avoid storing it once per object unless the info is already available. The alternative is to deduce it from some other information available, like you might be able to deduce it from the object's index or pointer address, at which point storing the ID would just be redundant information.
I had my doubts since I first saw where it leads, but now that I look at some code I have (medium-ish beginner), it strikes me as not only ugly, but potentially slow?
If I have a struct S inside a class A, called with class B (composition), and I need to do something like this:
struct S { int x[3] {1, 2, 3}; };
S *s;
A(): s {new S} {}
B(A *a) { a->s->x[1] = 4; }
How efficient is this chain: a->s->x[1]? Is this ugly and unnecessary? A potential drag? If there are even more levels in the chain, is it that much uglier? Should this be avoided? Or, if by any chance none of the previous, is it a better approach than:
S s;
B(A *a): { a->s.x[1] = 4; }
It seems slower like this, since (if I got it right) I have to make a copy of the struct, rather than working with a pointer to it. I have no idea what to think about this.
is it a better approach
In the case you just showed no, not at all.
First of all, in modern C++ you should avoid raw pointers with ownership which means that you shouldn't use new, never. Use one of the smart pointers that fit your needs:
std::unique_ptr for sole ownership.
std::shared_ptr for multiple objects -> same resource.
I can't exactly tell you about the performance but direct access through the member s won't ever be slower than direct access through the member s that is dereferenced. You should always go for the non-pointer way here.
But take another step back. You don't even need pointers here in the first place. s should just be an object like in your 2nd example and replace the pointer in B's constructor for a reference.
I have to make a copy of the struct, rather than working with a
pointer to it.
No, no copy will be made.
The real cost of using pointers to objects in many iterations, is not necessarily the dereferencing of the pointer itself, but the potential cost of loading another cache frame into the CPU cache. As long as the pointers points to something within the currently loaded cache frame, the cost is minimal.
Always avoid dynamic allocation with new wherever possible, as it is potentially a very expensive operation, and requires an indirection operation to access the thing you allocated. If you do use it, you should also be using smart pointers, but in your case there is absolutely no reason to do so - just have an instance of S (a value, not a pointer) inside your class.
If you consider a->s->x[1] = 4 as ugly, then it is rather because of the chain than because of the arrows, and a->s.x[1] = 4 is ugly to the same extent. In my opinion, the code exposes S more than necessary, though there may sometimes exist good reasons for doing so.
Performance is one thing that matters, others are maintainability and adaptability. A chain of member accesses usually supports the principle of information hiding to a lesser extent than designs where such chains are avoided; Involved objects (and therefore the involved code) is tighter coupled than otherwise, and this usually goes on the cost of maintainability (confer, for example, Law of Demeter as a design principle towards better information hiding:
In particular, an object should avoid invoking methods of a member
object returned by another method. For many modern object oriented
languages that use a dot as field identifier, the law can be stated
simply as "use only one dot". That is, the code a.b.Method() breaks
the law where a.Method() does not. As an analogy, when one wants a dog
to walk, one does not command the dog's legs to walk directly; instead
one commands the dog which then commands its own legs.
Suppose, for example, that you change the size of array x from 3 to 2, then you have to review not only the code of class A, but potentially that of any other class in your program.
However, if we avoid exposing to much of component S, class A could be extended by a member/operator int setSAt(int x, int value), which can then also check, for example, array boundaries; changing S influences only those classes that have S as component:
B(A *a) { a->setSAt(1,4); }
I'm making a little game in C++. I found answers on StackExchange sites about cache coherency, and I would like to use it in my game, but I'm using child classes of an abstract class, Entity.
I'm storing all entities in a std::vector so that I can access virtual functions in loops. Entity::update() is a virtual function of Entity overridden by subclasses like PlayerEntity.
In Game.hpp - Private Member Variables:
std::vector<Entity*> mEntities;
PlayerEntity* mPlayer;
In Game.cpp - Constructor:
mPlayer = new PlayerEntity();
mEntities.push_back(mPlayer);
Here's what my update function (in the main loop) looks like:
void Game::update() {
for (Entity* entity : mEntities) {
entity->update(mTimeStep, mGeneralClock.getElapsedTime().asMilliseconds());
}
}
My question is:
How do I make my entities objects be next to each other in memory, and thus achieve cache coherency?
I tried to simply make the vector of pointers a vector of objects and make the appropriate changes, but then I couldn't use polymorphism for obvious reasons.
Side question: what determines where an object in allocated in memory?
Am I doing the whole thing wrong? If so, how should I store my entities?
Note: I'm sorry if my english is bad, I'm not a native speaker.
Obviously, first measure which parts are even worth optimizing. Not all games are created equal, and not all code within a game is created equal. There is no use in completely restructuring the script that triggers the end boss's death animation to make it use 1 cache line instead of 2. That said...
If you are aiming for optimizing for cache, forget about inheritance and virtual functions. Or at least be critical of them. As you note, creating a contiguous array of polymorphic objects is somewhere between hard & error-prone and completely infeasible (depending on whether subclasses have different sizes).
You can attempt to create a pool, to have nearby entities (in the entities vector) more likely to be close to each other (in memory), but frankly I doubt you'll do much better than a state of the art general-purpose allocator, especially when the entities' size and lifetime varies significantly. A pool would only help if entities adjacent in the vector are allocated back-to-back. But in that case, any standard allocator gives the same locality advantages. It's not like tcmalloc and friends select a random cache line to allocate from just to annoy you.
You might be able squeeze a bit of memory out of knowing your object types, but this is purely hypothetical and would have to be proven first to justify the effort of implementing it. Also note that a run of the mill pool either assumes that all objects are the same size, or that you never deallocate individual objects. Allowing both puts you halfway towards a general-purpose allocator, which you're bound to do worse.
You can segregate objects based on their types. That is, instead of a single vector with polymorphic Entitys with virtual functions, have N vectors: vector<Bullet>, vector<Monster>, vector<Loot>, and so on. This is less insane than it sounds for threereasons:
Often, you can pull out the entire business of managing one such vector into a dedicated system. So in the end you might even have a vector<System *> where each System has a vector for one kind of thing, and updates all those things in a single virtual call (delegating to many statically-dispatched calls).
You don't need to represent everything ever in this abstraction. Not every little integer needs to be wrapped in its own type of entity.
If you go further down this route and take hints from entity component systems, you also gain an alternative to inheritance for code reuse (class Monster : Entity {}; class Skeleton : Monster {};) that plays nicer with the hard-earned cache friendliness.
It is not easy because polymorphism doesn't work well with cache coherency.
I think the best you can overload the base class new operator to allocate memory from a pool. But to do this, you need to know the size of all derived classes and after some allocating/deallocating you can have memory fragmentation which will lower the gain.
Have a look at Cachegrind, it's a tool that simulates how your program interacts with a machine's cache hierarchy.
I'm storing a large amount of computed data and I'm currently using a polymorphic type to reduce the amount of storage required. Everything is extremely fast except for deleting the objects when I'm finished and I think there must be a better alternative. The code computes the state at each step and depending on the conditions present it needs to store certain values. The worst case is storing the full object state and the best state is storing almost nothing. The (very simplified) setup is as follows:
class BaseClass
{
public:
virtual ~BaseClass() { }
double time;
unsigned int section;
};
class VirtualSmall : public BaseClass
{
public:
double values[2];
int othervalue;
};
class VirtualBig : public BaseClass
{
public:
double values[16];
int othervalues[5];
};
...
std::vector<BaseClass*> results(10000);
The appropriate object type is generated during computation and a pointer to it is stored in the vector. The overhead from vtable+pointer is overall much smaller than than the size difference between the largest and smallest object (which is least 200 bytes according to sizeof). Since often the smallest object can be used instead of the largest and there are potentially many tens of millions of them stored it can save a few gigabytes of memory usage. The results can then be searched extremely fast as the base class contains the information necessary to find the correct item which can then be dynamic_cast back to it's real type. It works very well for the most part.
The only issue is with delete. It takes a few seconds to free all of the memory when there is many tens of millions of objects. The delete code iterates through each object and delete results[i] which calls the virtual destructor. While it's not impossible to work around I think there must be a more elegant solution.
It could definitely be done by allocating largish contiguous blocks of memory (with malloc or similar), which are kept track of and then something generates a correct pointers to the next batch of free memory inside of the block. That pointer is then stored in the vector. To free the memory the smaller number of large blocks need to have free() called on them. There is no more vtable (and it can be replaced by a smaller type field to ensure the correct cast) which saves space as well. It is very much a C style solution though and not particularly pretty.
Is there a C++ style solution to this type of problem I'm overlooking?
You can overload the "new" operator (i.e. void* VirtualSmall::operator new(size_t) ) for you classes, and implement them to obtain memory from custom allocators. I would use one block allocator for each derived class, so that each block size is a multiple of the class' it's supposed to store.
When it's time to cleanup, tell each allocators to release all blocks. No destructors will be called, so make sure you don't need them.
This is a pretty basic question but I'm still unsure:
If I have a class that will be instantiated millions of times -- is it advisable not to derive it from some other class? In other words, does inheritance carry some cost (in terms of memory or runtime to construct or destroy an object) that I should be worrying about in practice?
Example:
class Foo : public FooBase { // should I avoid deriving from FooBase?
// ...
};
int main() {
// constructs millions of Foo objects...
}
Inheriting from a class costs nothing at runtime.
The class instances will of course take up more memory if you have variables in the base class, but no more than if they were in the derived class directly and you didn't inherit from anything.
This does not take into account virtual methods, which do incur a small runtime cost.
tl;dr: You shouldn't be worrying about it.
i'm a bit surprised by some the responses/comments so far...
does inheritance carry some cost (in terms of memory)
Yes. Given:
namespace MON {
class FooBase {
public:
FooBase();
virtual ~FooBase();
virtual void f();
private:
uint8_t a;
};
class Foo : public FooBase {
public:
Foo();
virtual ~Foo();
virtual void f();
private:
uint8_t b;
};
class MiniFoo {
public:
MiniFoo();
~MiniFoo();
void f();
private:
uint8_t a;
uint8_t b;
};
class MiniVFoo {
public:
MiniVFoo();
virtual ~MiniVFoo();
void f();
private:
uint8_t a;
uint8_t b;
};
} // << MON
extern "C" {
struct CFoo {
uint8_t a;
uint8_t b;
};
}
on my system, the sizes are as follows:
32 bit:
FooBase: 8
Foo: 8
MiniFoo: 2
MiniVFoo: 8
CFoo: 2
64 bit:
FooBase: 16
Foo: 16
MiniFoo: 2
MiniVFoo: 16
CFoo: 2
runtime to construct or destroy an object
additional function overhead and virtual dispatch where needed (including destructors where appropriate). this can cost a lot and some really obvious optimizations such as inlining may/can not be performed.
the entire subject is much more complex, but that will give you an idea of the costs.
if the speed or size is truly critical, then you can often use static polymorphism (e.g. templates) to achieve an excellent balance between performance and ease to program.
regarding cpu performance, i created a simple test which created millions of these types on the stack and on the heap and called f, the results are:
FooBase 16.9%
Foo 16.8%
Foo2 16.6%
MiniVFoo 16.6%
MiniFoo 16.2%
CFoo 15.9%
note: Foo2 derives from foo
in the test, the allocations are added to a vector, then deleted. without this stage, the CFoo was entirely optimized away. as Jeff Dege posted in his answer, allocation time will be a huge part of this test.
Pruning the allocation functions and vector create/destroy from the sample produces these numbers:
Foo 19.7%
FooBase 18.7%
Foo2 19.4%
MiniVFoo 19.3%
MiniFoo 13.4%
CFoo 8.5%
which means the virtual variants take over twice as long as the CFoo to execute their constructors, destructors and calls, and MiniFoo is about 1.5 times faster.
while we're on allocation: if you can use a single type for your implementation, you also reduce the number of allocations you must make in this scenario because you can allocate an array of 1M objects, rather than creating a list of 1M addresses and then filling it with uniquely new'ed types. of course, there are special purpose allocators which can reduce this weight. since allocations/free times are the weight of this test, it would significantly reduce the time you spend allocating and freeing objects.
Create many MiniFoos as array 0.2%
Create many CFoos as array 0.1%
Also keep in mind that the sizes of MiniFoo and CFoo consume 1/4 - 1/8 the memory per element, and a contiguous allocation removes the need to store pointers to dynamic objects. You could then keep track of the object in an implementation more ways (pointer or index), but the array can also significantly reduce allocation demends on clients (uint32_t vs pointer on a 64 bit arch) -- plus all the bookkeeping required by the system for the allocations (which is significant when dealing with so many small allocations).
Specifically, the sizes in this test consumed:
32 bit
267MB for dynamic allocations (worst)
19MB for the contiguous allocations
64 bit
381MB for dynamic allocations (worst)
19MB for the contiguous allocations
this means that the required memory was reduced by more than ten, and the times spent allocating/freeing is significantly better than that!
Static dispatch implementations vs mixed or dynamic dispatch can be several times faster. This typically gives the optimizers more opportunuities to see more of the program and optimize it accordingly.
In practice, dynamic types tend to export more symbols (methods, dtors, vtables), which can noticably increase the binary size.
Assuming this is your actual use case, then you can improve the performance and resource usage significantly. i've presented a number of major optimizations... just in case somebody believes changing the design in such a way would qualify as 'micro'-optimizations.
Largely, this depends upon the implementation. But there are some commonalities.
If your inheritance tree includes any virtual functions, the compiler will need to create a vtable for each class - a jump table with pointers to the various virtual functions. Every instance of those classes will carry along a hidden pointer to its class's vtable.
And any call to a virtual function will involve a hidden level of indirection - rather than jumping to a function address that had been resolved at link time, a call will involve reading the address from the vtable and then jumping to that.
Generally speaking, this overhead isn't likely to be measurable on any but the most time-critical software.
OTOH, you said you'd be instantiating and destroying millions of these objects. In most cases, the largest cost isn't constructing the object, but allocating memory for it.
IOW, you might benefit from using your own custom memory allocators, for the class.
http://www.cprogramming.com/tutorial/operator_new.html
I think we all guys have been programming too much as lone wolf ..
We forget to take cost of maintenance + readability + extensions with regards to features.
Here is my take
Inheritance Cost++
On smaller projects : time to develop increases. Easy to write all global sudoku code. Always has it taken more time for me, to write a class inheritance to do the right_thing.
On smaller projects : Time to modify increases. It is not always easy to modify the existing code to confirm the existing interface.
Time to design increases.
Program is slightly inefficient due to multiple message passing, rather than exposed gut(I mean data members. :))
Only for the virtual function calls via pointer to base class, there one single extra dereference.
There is a small space penalty in terms of RTTI
For sake of completeness I will add that, too many classes will add too many types and that is bound to increase your compilation time, no matter how small it might be.
There is also cost of tracking multiple objects in terms of base class object and all for run-time system, which obviously mean a slight increase in code size + slight runtime performance penalty due to the exception delegation mechanism(whether you use it or not).
You dont have to twist your arm unnaturally in a way of PIMPL, if all you want to do is to insulate users of your interface functions from getting recompiled. (This IS a HEAVY cost, trust me.)
Inheritance Cost--
As the program size grows larger than 1/2 thousand lines, it is more maintainable with inheritance. If you are the only one programming then you can easily push code without object upto 4k/5k lines.
Cost of bug fixing reduces.
You can easily extend the existing framework for more challenging tasks.
I know I am being a little devils advocate, but I think we gotta be fair.
If you need the functionality of FooBase in Foo, either you can derive or use composition. Deriving has the cost of the vtable, and FooBase has the cost of a pointer to a FooBase, the FooBase, and the FooBase's vtable. So they are (roughly) similar and you shouldn't have to worry about the cost of inheritance.
Creating a derived object involves calling constructors for all base classes, and destroying them invokes destructors for these classes. The cost depends then on what these constructors do, but then if you don't derive but include the same functionality in derived class, you pay the same cost. In terms of memory, every object of the derived class contains an object of its base class, but again, it's exactly the same memory usage as if you just included all of these fields in the class instead of deriving it.
Be wary, that in many cases it's a better idea to compose (have a data member of the 'base' class rather than deriving from it), in particular if you're not overriding virtual functions and your relationship between 'derived' and 'base' is not an "is a kind of" relationship. But in terms of CPU and memory usage both these techniques are equivalent.
The fact is that if you are in doubt of whether you should inherit or not, the answer is that you should not. Inheritance is the second most coupling relationship in the language.
As of the performance difference, there should be almost no difference in most cases, unless you start using multiple inheritance, where if one of the bases has virtual functions, the dispatch will have an additional (minimal, negligible) cost if the base subobject is not aligned with the final overrider as the compiler will add a thunk to adjust the this pointer.