Avoiding repeated C++ virtual table lookup - c++

I have C++ program that reads a config file when the binary is executed, creates a number of child class instances based on the config file, and then periodically iterates over these instances and calls their respective virtual functions.
Gprof is telling me that these function calls are taking up a lot of time (the aforementioned iteration happens very frequently), so I want to try to avoid the repeated virtual function calls somehow.
The code is similar to the following. Once the program populates vector v at the start of the program, this vector won't change anymore for the rest of the program, so it seems inefficient to repeatedly have to do a virtual table lookup every time I want to call f(). I would think there must be a way to cache or save the function pointers somehow, but I'm not sure how.
Would love any suggestions you have on speeding things up. Thank you!
Edit: Sorry, I forgot to mention that the function calls f() for the vector of Child instances has to be in order from 0 to v.size() - 1, so I can't group together the elements of v that have the same derived type.
Also, this was built with -O3 -std=c++14
class Parent {
public:
virtual void f() { }
};
class Child1 : public Parent {
public:
void f() { /* do stuff for child1 */ }
};
//...
class Child9 : public Parent {
public:
void f() { /* do stuff for child9 */ }
};
int main() {
vector<Parent*> v;
// read config file and add Child instances to v based on the file contents
while (true) {
// do other stuff
for (size_t i = 0; i != v.size(); ++i) {
v[i]->f(); // expensive to do the same virtual table lookups every loop!
}
}
};

Based on some of the questions and your answers in the comments, here are a couple of considerations.
1) Your problem (if there is one, your solution might already be close to optimal, depending on details you have not mentioned) is most likely somewhere else, not in the overhead of a virtual function call.
If you really run this in a tight loop, and there's not much going on in the implementations of f() that touches a lot of memory, your vtables probably remain in the L1 cache, and the virtual function call overhead will be absolutely minimal, if any, on modern hardware.
2) You say "the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third address" - this might not be as innocent as you expect. For reference, going to L1 cache will cost you about 3 cycles, going to RAM may cost as much as 60-200, depedning on your hardware.
If you have enough of these objects (so that keeping all of the memory they reference in L1 cache is not possible), and the memory locations they reference are basically random (so that prefetching is ineffective), and/or you touch enough things in the rest of your program (so that all the relevant data gets vacated from cache between the loops over your vector), the cost of fetching and storing the values from and to memory/lower levels of cache will outweigh the cost of the virtual function calls by orders of magnitude in the worst case.
3) You iterate over a vector of pointers to objects - not the objects themselves.
Depending on how you allocate the objects and how big they are, this might not be an issue - prefetching will do wonders for you if you allocate them in a tight loop and your allocator packs them nicely. If, however, you allocate/free a lot of other things and mix in the allocations of these objects in between, they may end up located sparsely and in basically random locations in memory; then iterating over them in the order of creation will involve a lot random reads from memory, which will again be far slower than any virtual function overhead.
4) You say "calls to f() for the vector of children has to be in order" - do they?
If they do, then you are out of luck in some ways. If, however, you can re-architect your system so that they can be called ordered by type, then there is a lot of speed to be gained in various aspects - you could probably allocate an array of each type of object (nice, dense packing in memory), iterate over them in order (prefetcher friendly), and call your f()'s in groups for a single, well known type (inlining friendly, instruction cache friendly).
5) And finally - if none of the above applies and your problem is really in virtual function calls (unlikely), then, yes, you can try storing a pointer to the exact function you need to call for each object in some fashion - either manually or by using one of the type erasure / duck typing methods others have suggested.
My main point is this - there a lot of performance benefits to be had from changing the architecture of your system in some ways.
Remember: accessing things that are already in L1/L2 cache is good, having to go to L3/RAM for data is worse; accessing memory in a sequential order is good, jumping all over memory is bad; calling the same method in a tight loop, potentially inlining it, is good, calling a lot of different methods in a tight loop is worse.
If this is a part of your program the performance of which really matters, you should consider changing the architecture of your system to allow for some of the previously mentioned optimizations. I know this may seem daunting, but that is the game we are playing. Sometimes you need to sacrifice "clean" OOP and abstractions for performance, if the problem you are solving allows for it.

Edit: For vector of arbitrary child types mixed in together, I recommend going with the virtual call.
If, depending on config, there were a vector of only one child type - or if you can separate the different types into separate containers, then this could be a case where compile time polymorphism might be an option instead of runtime one. For example:
template<class Child, class Range>
void f_for(Range& r) {
for (Parent* p : r) {
Child* c = static_cast<Child*>(p);
c->Child::f(); // use static dispatch to avoid virtual lookup
}
}
...
if (config)
f_for<Child1>(v);
else
f_for<Child2>(v);
Alternative to explicit static dispatch would be to mark the child class or the member function final.
You might even expand the static portion of the program so that you get to use vector<Child1> or vector<Child2> directly, avoiding the extra indirection. At this point the inheritance is not even necessary.

Related

How efficient is accessing variables through a chain->of->pointers?

I had my doubts since I first saw where it leads, but now that I look at some code I have (medium-ish beginner), it strikes me as not only ugly, but potentially slow?
If I have a struct S inside a class A, called with class B (composition), and I need to do something like this:
struct S { int x[3] {1, 2, 3}; };
S *s;
A(): s {new S} {}
B(A *a) { a->s->x[1] = 4; }
How efficient is this chain: a->s->x[1]? Is this ugly and unnecessary? A potential drag? If there are even more levels in the chain, is it that much uglier? Should this be avoided? Or, if by any chance none of the previous, is it a better approach than:
S s;
B(A *a): { a->s.x[1] = 4; }
It seems slower like this, since (if I got it right) I have to make a copy of the struct, rather than working with a pointer to it. I have no idea what to think about this.
is it a better approach
In the case you just showed no, not at all.
First of all, in modern C++ you should avoid raw pointers with ownership which means that you shouldn't use new, never. Use one of the smart pointers that fit your needs:
std::unique_ptr for sole ownership.
std::shared_ptr for multiple objects -> same resource.
I can't exactly tell you about the performance but direct access through the member s won't ever be slower than direct access through the member s that is dereferenced. You should always go for the non-pointer way here.
But take another step back. You don't even need pointers here in the first place. s should just be an object like in your 2nd example and replace the pointer in B's constructor for a reference.
I have to make a copy of the struct, rather than working with a
pointer to it.
No, no copy will be made.
The real cost of using pointers to objects in many iterations, is not necessarily the dereferencing of the pointer itself, but the potential cost of loading another cache frame into the CPU cache. As long as the pointers points to something within the currently loaded cache frame, the cost is minimal.
Always avoid dynamic allocation with new wherever possible, as it is potentially a very expensive operation, and requires an indirection operation to access the thing you allocated. If you do use it, you should also be using smart pointers, but in your case there is absolutely no reason to do so - just have an instance of S (a value, not a pointer) inside your class.
If you consider a->s->x[1] = 4 as ugly, then it is rather because of the chain than because of the arrows, and a->s.x[1] = 4 is ugly to the same extent. In my opinion, the code exposes S more than necessary, though there may sometimes exist good reasons for doing so.
Performance is one thing that matters, others are maintainability and adaptability. A chain of member accesses usually supports the principle of information hiding to a lesser extent than designs where such chains are avoided; Involved objects (and therefore the involved code) is tighter coupled than otherwise, and this usually goes on the cost of maintainability (confer, for example, Law of Demeter as a design principle towards better information hiding:
In particular, an object should avoid invoking methods of a member
object returned by another method. For many modern object oriented
languages that use a dot as field identifier, the law can be stated
simply as "use only one dot". That is, the code a.b.Method() breaks
the law where a.Method() does not. As an analogy, when one wants a dog
to walk, one does not command the dog's legs to walk directly; instead
one commands the dog which then commands its own legs.
Suppose, for example, that you change the size of array x from 3 to 2, then you have to review not only the code of class A, but potentially that of any other class in your program.
However, if we avoid exposing to much of component S, class A could be extended by a member/operator int setSAt(int x, int value), which can then also check, for example, array boundaries; changing S influences only those classes that have S as component:
B(A *a) { a->setSAt(1,4); }

How to achieve cache coherency with an abstract class pointer vector in C++?

I'm making a little game in C++. I found answers on StackExchange sites about cache coherency, and I would like to use it in my game, but I'm using child classes of an abstract class, Entity.
I'm storing all entities in a std::vector so that I can access virtual functions in loops. Entity::update() is a virtual function of Entity overridden by subclasses like PlayerEntity.
In Game.hpp - Private Member Variables:
std::vector<Entity*> mEntities;
PlayerEntity* mPlayer;
In Game.cpp - Constructor:
mPlayer = new PlayerEntity();
mEntities.push_back(mPlayer);
Here's what my update function (in the main loop) looks like:
void Game::update() {
for (Entity* entity : mEntities) {
entity->update(mTimeStep, mGeneralClock.getElapsedTime().asMilliseconds());
}
}
My question is:
How do I make my entities objects be next to each other in memory, and thus achieve cache coherency?
I tried to simply make the vector of pointers a vector of objects and make the appropriate changes, but then I couldn't use polymorphism for obvious reasons.
Side question: what determines where an object in allocated in memory?
Am I doing the whole thing wrong? If so, how should I store my entities?
Note: I'm sorry if my english is bad, I'm not a native speaker.
Obviously, first measure which parts are even worth optimizing. Not all games are created equal, and not all code within a game is created equal. There is no use in completely restructuring the script that triggers the end boss's death animation to make it use 1 cache line instead of 2. That said...
If you are aiming for optimizing for cache, forget about inheritance and virtual functions. Or at least be critical of them. As you note, creating a contiguous array of polymorphic objects is somewhere between hard & error-prone and completely infeasible (depending on whether subclasses have different sizes).
You can attempt to create a pool, to have nearby entities (in the entities vector) more likely to be close to each other (in memory), but frankly I doubt you'll do much better than a state of the art general-purpose allocator, especially when the entities' size and lifetime varies significantly. A pool would only help if entities adjacent in the vector are allocated back-to-back. But in that case, any standard allocator gives the same locality advantages. It's not like tcmalloc and friends select a random cache line to allocate from just to annoy you.
You might be able squeeze a bit of memory out of knowing your object types, but this is purely hypothetical and would have to be proven first to justify the effort of implementing it. Also note that a run of the mill pool either assumes that all objects are the same size, or that you never deallocate individual objects. Allowing both puts you halfway towards a general-purpose allocator, which you're bound to do worse.
You can segregate objects based on their types. That is, instead of a single vector with polymorphic Entitys with virtual functions, have N vectors: vector<Bullet>, vector<Monster>, vector<Loot>, and so on. This is less insane than it sounds for threereasons:
Often, you can pull out the entire business of managing one such vector into a dedicated system. So in the end you might even have a vector<System *> where each System has a vector for one kind of thing, and updates all those things in a single virtual call (delegating to many statically-dispatched calls).
You don't need to represent everything ever in this abstraction. Not every little integer needs to be wrapped in its own type of entity.
If you go further down this route and take hints from entity component systems, you also gain an alternative to inheritance for code reuse (class Monster : Entity {}; class Skeleton : Monster {};) that plays nicer with the hard-earned cache friendliness.
It is not easy because polymorphism doesn't work well with cache coherency.
I think the best you can overload the base class new operator to allocate memory from a pool. But to do this, you need to know the size of all derived classes and after some allocating/deallocating you can have memory fragmentation which will lower the gain.
Have a look at Cachegrind, it's a tool that simulates how your program interacts with a machine's cache hierarchy.

virtual member functions are good or bad for locality in modern CPUs?

Considering the new CPUs with new instructions for moving and new memory controllers, if in C++ I have a vector of Derived objects where Derived is composed of virtual member functions, is this a good or a bad thing for the locality ?
And what if I have a vector of pointers to the base class Base* where I store references to derived objects that are 1-2-3 level up from Base ?
Basically dynamic typing applies to both cases, but which one is better for caching and memory access ?
I have a preference between this 2 but I would like to see a complete answer on the subject.
There is something new to consider as ground-braking from the hardware industry in the last 2-3 years ?
Storing Derived rather than Base * in a vector is better because it eliminates one extra level of indirection and you have all objects laid out «together» in a continuous memory, which in turn makes life easier for a hardware prefetcher, helps with paging, TLB misses, etc. However, if you do this, make sure you don't introduce a slicing problem.
As for the virtual dispatch in this case, it almost does not matter with an exception of adjustment required for «this» pointer. For example, if Derived overrides a virtual function that you are calling and you already have a pointer to Devied *, then «this» adjustment is not required, and otherwise it should be adjusted to one of the base class`s «this» value (this also depends on size of the classes in inheritance hierarchy).
As long as all classes in a vector have the same overloads, CPU would be able to predict what's going on. However, if you have a mix of different implementations, then CPU would have no clue as to what function will be called for every next object, and that might cause performance issues.
And don't forget to always profile before and after you make changes.
Modern CPU's know how to optimise data-dependent jump instructions, as well as it can for data dependent "branch" instructions - the processor will "learn" that "Last time I went through here, I went THIS way", and if it has enough confidence (gone through several times with the same result) it ill keep going that way.
Of course that doesn't help if the instances are a complete random selection of different classes that each have it's own virtual function.
Cache-locality is of course a slightly different matter, and it really depends on whether you are storing the object instances or the pointers/references to instances in the vector.
And of course, an important factor is "what is the alternative?" - if you are using virtual functions "correctly", it means that there is (at least) one less conditional check in a code-path, because the decision was taken at a much earlier stage. That condition would be (assuming the probability corresponds the same) to the branch probability of the decision, if you solve it by some other method - which will be at least as bad for performance as virtual functions with the same probability (chances are that it's worse, because we now have a if (x) foo(); else bar(); type scenario, so we first have to evaluate x then choose the path. obj->vfunc() will just be unpredictable because fetching for the vtable gives an unpredictable result - but at least the vtable itself is cached.

What's the "price" of using inheritance in an embedded environment with C++?

I'm starting a new embedded project with C++ and I was wondering if it is too much expensive to use a interface oriented design. Something like this:
typedef int data;
class data_provider {
public:
virtual data get_data() = 0;
};
class specific_data_provider : public data_provider {
public:
data get_data() {
return 7;
}
};
class my_device {
public:
data_provider * dp;
data d;
my_device (data_provider * adp) {
dp = adp;
d = 0;
}
void update() {
d = dp->get_data();
}
};
int
main() {
specific_data_provider sdp;
my_device dev(&sdp);
dev.update();
printf("d = %d\n", dev.d);
return 0;
}
Inheritance, on its own, is free. For example, below, B and C are the same from a performance/memory point of view:
struct A { int x; };
struct B : A { int y; };
struct C { int x, y; };
Inheritance only incurs a cost when you have virtual functions.
struct A { virtual ~A(); };
struct B : A { ... };
Here, on virtually all implementations, both A and B will be one pointer size larger due to the virtual function.
Virtual functions also have other drawbacks (when compared with non-virtual functions)
Virtual functions require that you look up the vtable when called. If that vtable is not in the cache then you will get an L2 miss, which can be incredibly expensive on embedded platforms (over 600 cycles on current gen game consoles for example).
Even if you hit the L2 cache, if you branch to many different implementations then you will likely get a branch misprediction on most calls, causing a pipeline flush, which again costs many cycles.
You also miss out on many optimisation opportunities due to virtual functions being essentially impossible to inline (except in rare cases). If the function you call is small then this could add a serious performance penalty compared to a inlined non-virtual function.
Virtual calls can contribute to code bloat. Every virtual function call adds several bytes worth of instructions to lookup the vtable, and many bytes for the vtable itself.
If you use multiple inheritance then things get worse.
Often people will tell you "don't worry about performance until your profiler tells you to", but this is terrible advice if performance is at all important to you. If you don't worry about performance then what happens is that you end up with virtual functions everywhere, and when you run the profiler, there is no one hotspot that needs optimising -- the whole code base needs optimising.
My advice would be to design for performance if it is important to you. Design to avoid the need for virtual functions if at all possible. Design your data around the cache: prefer arrays to node-based data structures like std::list and std::map. Even if you have a container of a few thousand elements with frequent insertions into the middle, I would still go for an array on certain architectures. The several thousand cycles you lose copying data for the insertions may well be offset by the cache locality you will achieve on each traversal (Remember the cost of a single L2 cache miss? You can expect a lot of those when traversing a linked list)
Inheritance is basically free. However, polymorphism and dynamic dispatch (virtual) have some consequences: each instance of a class with a virtual method contains a pointer to the vtable, which is used to select the right method to call. This adds two memory access for each virtual method call.
In most cases it won't be a problem, but it can become a bottleneck in some real time applications.
Really depends on your hardware. Inheritance per se probably doesn't cost you anything. Virtual methods cost you some amount of memory for the vTable in each class. Turning on exception handling probably costs you even more in both memory and performance. I have used all the features of C++ extensively on the NetBurner platform with chips like the MOD5272 which have a couple of Megs of Flash and 8 Megs of RAM. Also some things may be implementation dependent, on the GCC toolchain I use, when cout gets used instead of printf you take a big memory hit (it appears to link in a bunch of libraries). You might be interested in a blog post I wrote on the cost of type safe code. You would have to run similar tests on your environment to truly answer your question.
The usual advice is to make the code clear and correct, and then think about optimisations only if it proves to be a problem (too slow or too much memory) in practice.

Virtual Function Compared to Pointer Casting

The current version of some code I'm using utilises a slightly odd way of acheiving something which I think could be acheived with polymorphism. More concretely we currently use something like
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
if( W->type == someTypeA )
{
// do some things which also involve casts such as
// ((SomeClassA*) W->objectptr)->someFieldA
}
else if( W->type == someTypeB )
{
// do some things which also involve casting such as
// ((SomeClassB*) W->objectptr)->someFieldB
}
}
To clarify; each object W contains a void *objectptr; that is to say a pointer to an arbitrary location. The field W->type keeps track of what type of object objectptr points at so that inside our if/else statements we can cast W->objectptr to the correct type and use it's fields.
However, this seems inherently bad from a code design stand point for several reasons;
We have no guarantee that the object pointed to by W->objectptr actually matches what is said in W->type so the cast is inherently unsafe.
Every time we wish to add another type we must add another elseif statement and ensure W->type is set correctly.
It seems to be this would be much better solved with something like
class CObj
{
public:
virtual void doSomething(/* some params */)=0;
};
class SomeClassA : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldA;
}
class SomeClassB : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldB;
}
// sometime later...
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
W->doSomething(/* some params */);
}
This having been said there is the proviso that in this setting performace is important. This code will be called from a (relatively) tight loop.
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
EDIT: It occurs to me that accessing the fields through a pointer in this way could be as bad as vtable lookups anyway due to cache misses etc. Any thoughts on this?
---- EDIT 2: Also I forgot to mention (and I know it's a bit off the original topic), inside the if statements are many calls to member functions of the surrounding class. How would you design the structure so as to be able to call these from inside doSomething()?
I'm going to answer specifically on the performance angle, because I work in a perf-critical environment and a while ago I happened to run measurements on a similar case to work out the fastest solution.
If you are on an x86, PPC, or ARM processor, you want virtual functions in this situation. The performance cost of calling a virtual function is mostly the pipeline bubble induced by mispredicting an indirect branch. Because the instruction fetch stage of the CPU can't know where the computed jmp goes, it can't start fetching bytes from the target address until the branch executes, and thus you have a stall in the pipeline corresponding to the number of stages between the first fetch stage and the branch retire. (On the PPC I know best, that's something like 25 cycles.)
You also have the latency of loading the vtable pointer, but this is often hidden by instruction reordering (the compiler moves the load instruction so it starts several cycles before you actually need the result and the CPU does other work while the data cache sends you its electrons.)
With the if-cascade approach you instead have some number n of direct, conditional branches — where the target is known at compile time, but whether the jump is taken is determined at runtime. (ie, a jump-on-equal opcode.) In this case the CPU will make a guess (predict) at whether each branch is taken or not, and start fetching instructions accordingly. So, you will only have a bubble if the CPU guesses wrong. Since you are presumably calling this function with different input each time, it's going to mispredict at least one of these branches, and you'll have the exact same bubble that you would with virtuals. In fact, you'll have a whole lot more bubbles — one per if() conditional!
With virtual functions, there's also the risk of an additional data cache miss on loading the vtable, and an icache miss on the jump target. If this function is in a tight loop, then presumably you'll be looking up and calling the same subroutines a lot, and thus the vtable and function bodies will probably still be in cache. You could measure that if you wanted to be really sure.
Use virtual functions, this hypothetical optimization means nothing. What matters is code readability, maintainability and quality.
Optimize later with the aid of a profiler if you really need to tune hot spots. Making your code unmaintainable with that kind of crap is a road to failure.
Also, virtual functions will help you do unit tests, mock interfaces, etc.
Programming is about managing complexity....
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
C++ compilers should be able to implement virtual functions very efficiently, so I don't think there's a downside in using them. (And certainly a huge maintainability/readability benefit!) But you should measure to make sure.
The way they are typically implemented is that each object has a vtable pointer. (multiple pointers in case of multiple inheritance, but let's forget that for now) This has the following relative costs over non-virtual functions.
data space: one pointer per object
data space: one vtable per class (not per object!)
time: worstcase = two memory reads per function call (1 to get the vtable address, 1 to get the function address within the vtable). The offset in the vtable is known at compile time, because you know which function you're calling. There's no extra jumps.
Compare this with the costs of the non-OOP approach your existing software has.
data space: one type ID per object
code space: one if/else tree or switch statement each time you wish to call a function dependent on the object type
time: having to evaluate the if/else tree or switch statement.
I'd vote for the virtual function approach as actually being faster than the non-OOP approach, because it eliminates the need to take the time and figure out what type of object it is.
I had some experience with some largish (1M+ line I think) scientific computation code that was using a similar type based switch construct. They refactored to a properly polymorphic based approach and got a significant speedup. Exactly the opposite of what they expected!
Turned out the compiler was better able to optimise some things in that structure.
However this was a long time ago (8 years or so) .. so who knows what modern compilers will do. Don't guess - profile it.
As piotr says the right answer is probably virtual functions. You'll have to test.
But to address your concern about the casts:
Never use C-style casts in a C++ program use static_cast<>, dynamic_cast<> etc..
In your specific case, use dynamic_cast<>. At least then you will get an exception if the types are not properly related, which better than a wild crash.
CRTP would be a great idea for such kind of cases.
Edit: In your case,
template<class T>
class CObj
{
public:
void doSomething(/* some params */)
{
static_cast<T*>(this)->doSomething(...);
}
};
class SomeClassA : public CObj<SomeClassA>
{
public:
void doSomething(/* some params */);
int someFieldA;
};
class SomeClassB : public CObj<<SomeClassB>
{
public:
void doSomething(/* some params */);
int someFieldB;
};
Now you may have to choose your loop code in different way to accommodate all objects of different CObj<T> type.