Can you cache a virtual function lookup in C++? - c++

Say I have a virtual function call foo() on an abstract base class pointer, mypointer->foo(). When my app starts up, based on the contents of a file, it chooses to instantiate a particular concrete class and assigns mypointer to that instance. For the rest of the app's life, mypointer will always point to objects of that concrete type. I have no way to know what this concrete type is (it may be instantiated by a factory in a dynamically loaded library). I only know that the type will stay the same after the first time an instance of the concrete type is made. The pointer may not always point to the same object, but the object will always be of the same concrete type. Notice that the type is technically determined at 'runtime' because it's based on the contents of a file, but that after 'startup' (file is loaded) the type is fixed.
However, in C++ I pay the virtual function lookup cost every time foo is called for the entire duration of the app. The compiler can't optimize the look up away because there's no way for it to know that the concrete type won't vary at runtime (even if it was the most amazing compiler ever, it can't speculate on the behavior of dynamically loaded libraries). In a JIT compiled language like Java or .NET the JIT can detect that the same type is being used over and over and do inline cacheing. I'm basically looking for a way to manually do that for specific pointers in C++.
Is there any way in C++ to cache this lookup? I realize that solutions might be pretty hackish. I'm willing to accept ABI/compiler specific hacks if it's possible to write configure tests that discover the relevant aspects of the ABI/compiler so that it's "practically portable" even if not truly portable.
Update: To the naysayers: If this wasn't worth optimizing, then I doubt modern JITs would do it. Do you think Sun and MS's engineers were wasting their time implementing inline cacheing, and didn't benchmark it to ensure there was an improvement?

There are two costs to a virtual function call: The vtable lookup and the function call.
The vtable lookup is already taken care of by the hardware. Modern CPUs (assuming you're not working on a very simple embedded CPU) will predict the address of the virtual function in their branch predictor and speculatively execute it in parallel with the array lookup. The fact that the vtable lookup happens in parallel with the speculative execution of the function means that, when executed in a loop in the situations you describe, virtual function calls have next to zero overhead compared to direct, non-inlined function calls.
I've actually tested this in the past, albeit in the D programming language, not C++. When inlining was disabled in the compiler settings and I called the same function in a loop several million times, the timings were within epsilon of each other whether the function was virtual or not.
The second and more important cost of virtual functions is that they prevent inlining of the function in most cases. This is even more important than it sounds because inlining is an optimization that can enable several other optimizations such as constant folding in some cases. There's no way to inline a function without recompiling the code. JITs get around this because they're constantly recompiling code during the execution of your application.

Why virtual call is expensive? Because you simply don't know the branch target until the code is executed in runtime. Even modern CPUs are still perfectly handling the virtual call and indirect calls. One can't simply say it costs nothing because we just have a faster CPU. No, it is not.
1. How can we make it fast?
You already have pretty deep understanding the problem. But, the only I can say that if the virtual function call is easy to predict, then you could perform software-level optimization. But, if it's not (i.e., you have really no idea what would be the target of the virtual function), then I don't think that there is good solution for now. Even for CPU, it is hard to predict in such extreme case.
Actually, compilers such as Visual C++'s PGO(Profiling guided optimization) has virtual call speculation optimization (Link). If the profiling result can enumerate hot virtual function targets, then it translate to direct call which can be inlined. This is also called devirtualization. It can be also found in some Java dynamic optimizer.
2. To those one who say it's not necessary
If you're using script languages, C# and concern about the coding efficiency, yes, it's worthless. However, anyone who are eager to save a single cycle to obtain better performance, then indirect branch is still important problem. Even the latest CPUs are not good to handle virtual calls. One good example would be a virtual machine or interpreter, which usually have a very large switch-case. Its performance is pretty much related to the correct prediction of indirect branch. So, you can't simply say it's too low-level or not necessary. There are hundreds of people who are trying to improve the performance in the bottom. That's why you can simply ignore such details :)
3. Some boring computer architectural facts related to virtual functions
dsimcha has written a good answer for how CPU can handle virtual call effectively. But, it's not exactly correct. First, all modern CPUs have branch predictor, which literally predicts the outcomes of a branch to increase pipeline throughput (or, more parallelism in instruction level, or ILP. I can even say that single-thread CPU performance is solely depending on how much you can extract ILP from a single thread. Branch prediction is the most critical factor for obtaining higher ILP).
In branch prediction, there are two predictions: (1) direction (i.e., the branch is taken? or not taken? binary answer), and (2) branch target (i.e., where will I go? it's not binary answer). Based on the prediction, CPU speculatively execute the code. If the speculation is not correct, then CPU rollbacks and restarts from the mis-predicted branch. This is completely hidden from programmer's view. So, you don't really know what's going on inside the CPU unless you're profiling with VTune which gives branch misprediction rates.
In general, branch direction prediction is highly accurate(95%+), but it is still hard to predict branch targets, especially virtual calls and switch-case(i.e., jump table). Vrtual call is indirect branch which requires a more memory load, and also CPU requires branch target prediction. Modern CPUs like Intel's Nehalem and AMD's Phenom have specialized indirect branch target table.
However, I don't think looking up vtable incurs a lot of overhead. Yes, it requires a more memory load which can make cache miss. But, once vtable is loaded into cache, then it's pretty much cache hit. If you're also concerned with that cost, you may put prefetching code to load vtable in advance. But, the real difficulty of virtual function call is that CPU can't do great job to predict the target of virtual call, which may result in pipeline drain frequently due to misprediction of the target.

So assuming that this is a fundamental issue you want to solve (to avoid premature optimization arguments), and ignoring platform and compiler specific hackery, you can do one of two things, at opposite ends of complexity:
Provide a function as part of the .dll that internally simply calls the right member function directly. You pay the cost of an indirect jump, but at least you don't pay the cost of a vtable lookup. Your mileage may vary, but on certain platforms, you can optimize the indirect function call.
Restructure your application such that instead of calling a member function per instance, you call a single function that takes a collection of instances. Mike Acton has a wonderful post (with a particular platform and application type bent) on why and how you should do this.

All answers are dealing with the most simple scenario, where calling a virtual method only requires getting the address of the actual method to call. In the general case, when multiple and virtual inheritance come into play, calling a virtual method requires shifting the this pointer.
The method dispatch mechanism can be implemented in more than one way, but it is common to find that the entry in the virtual table is not the actual method to call, but rather some intermediate 'trampoline' code inserted by the compiler that relocates the this pointer prior to calling the actual method.
When the dispatch is the simplest, just an extra pointer redirection, then trying to optimize it does not make sense. When the problem is more complex, then any solution will be compiler dependent and hackerish. Moreover, you do not even know in what scenario you are: if the objects are loaded from dlls then you don't really know whether the actual instance returned belongs to a simple linear inheritance hierarchy or a more complex scenario.

I have seen situations where avoiding a virtual function call is beneficial. This does not look to me to be one of those cases because you really are using the function polymorphically. You are just chasing one extra address indirection, not a huge hit, and one that might be partially optimized away in some situations. If it really does matter, you may want to restructure your code so that type-dependent choices such as virtual function calls are made fewer times, pulled outside of loops.
If you really think it's worth giving it a shot, you can set a separate function pointer to a non-virtual function specific to the class. I might (but probably wouldn't) consider doing it this way.
class MyConcrete : public MyBase
{
public:
static void foo_nonvirtual(MyBase* obj);
virtual void foo()
{ foo_nonvirtual(this); }
};
void (*f_ptr)(MyBase* obj) = &MyConcrete::foo_nonvirtual;
// Call f_ptr instead of obj->foo() in your code.
// Still not as good a solution as restructuring the algorithm.
Other than making the algorithm itself a bit wiser, I suspect any attempt to manually optimize the virtual function call will cause more problems than it solves.

You can't use a method pointer because pointers to member functions aren't considered covariant return types. See the example below:
#include <iostream>
struct base;
struct der;
typedef void(base::*pt2base)();
typedef void(der::*pt2der)();
struct base {
virtual pt2base method() = 0;
virtual void testmethod() = 0;
virtual ~base() {}
};
struct der : base {
void testmethod() {
std::cout << "Hello from der" << std::endl;
}
pt2der method() { **// this is invalid because pt2der isn't a covariant of pt2base**
return &der::testmethod;
}
};
The other option would be to have the method declared pt2base method() but then the return would be invalid because der::testmethod is not of type pt2base.
Also even if you had a method that received a ptr or reference to the base type you would have to dynamically cast it to the derived type in that method to do anything particularly polymorphic which adds back in the cost we're trying to save.

So, what you basically want to do is convert runtime polymorphism into compile time polymorphism. Now you still need to build your app so that it can handle multiple "cases", but once it's decided which case is applicable to a run, that's it for the duration.
Here's a model of the runtime polymorphism case:
struct Base {
virtual void doit(int&)=0;
};
struct Foo : public Base {
virtual void doit(int& n) {--n;}
};
struct Bar : public Base {
virtual void doit(int& n) {++n;}
};
void work(Base* it,int& n) {
for (unsigned int i=0;i<4000000000u;i++) it->doit(n);
}
int main(int argc,char**) {
int n=0;
if (argc>1)
work(new Foo,n);
else
work(new Bar,n);
return n;
}
This takes ~14s to execute on my Core2, compiled with gcc 4.3.2 (32 bit Debian), -O3 option.
Now suppose we replace the "work" version with a templated version (templated on the concrete type it's going to be working on):
template <typename T> void work(T* it,int& n) {
for (unsigned int i=0;i<4000000000u;i++) it->T::doit(n);
}
main doesn't actually need to be updated, but note that the 2 calls to work now trigger instantiations of and calls to two different and type-specific functions (c.f the one polymorphic function previously).
Hey presto runs in 0.001s. Not a bad speed up factor for a 2 line change! However, note that the massive speed up is entirely due to the compiler, once the possibility of runtime polymorphism in the work function is eliminated, just optimizing away the loop and compiling the result directly into the code. But that actually makes an important point: in my experience the main gains from using this sort of trick come from the opportunities for improved inlining and optimisation they allow the compiler when a less-polymorphic, more specific function is generated, not from the mere removal of vtable indirection (which really is very cheap).
But I really don't recommend doing stuff like this unless profiling absolutely indicates runtime polymorphism is really hitting your performance. It'll also bite you as soon as someone subclasses Foo or Bar and tries to pass that into a function actually intended for its base.
You might find this related question interesting too.

I asked a very similar question recently, and got the answer that it's possible as a GCC extension, but not portably:
C++: Pointer to monomorphic version of virtual member function?
In particular, I also tried it with Clang and it doesn't support this extension (even though it supports many other GCC extensions).

Could you use a method pointer?
The objective here is that the compiler would load the pointer with the location of the resolved method or function. This would occur once. After the assignment, the code would access the method in a more direct fashion.
I know that a pointer to an object and accessing the method via the object point invokes run-time polymorphism. However, there should be a way to load a method pointer to a resolved method, avoiding the polymorphism and directly calling the function.
I've checked the community wiki to introduce more discussion.

Related

C++ Is there any difference in performance by calling a func with code instead of calling the code directly? (From python)

I am from Python and still new at c++.
Now I wonder if calling a function is slower in performance then calling the code of the func itself?
Some example.
struct mynum {
public:
int m_value = 0;
constexpr
int value() { return m_value; }
// Say we would create a func here.
// That wants to use the value of "m_value"
// Is it slower to use "value()" instead of "m_value"?
// Even if the difference is very small.
// Or is there indeed no difference because everything gets compiled.
void somefunc() {
if(value() == 0) {}
}
}
If the function body is available at the time it is called, there is a good chance the compiler will try to either automatically inline it (the "inline" keyword is just a hint) or leave it as a function body. In both cases you are probably in the best path as compilers are pretty good at this kind of decisions - or better than us.
If only the function prototype (the declaration) is known by the compiler and the body is defined in another compilation unit (*.cpp file) then there are a couple of hits you might take:
The processor pipeline (and speculative execution) might stall which may call you a few cycles although processors have become extremely efficient at these things in the past 10 years or so. Even dynamic branch optimization has become so good that there is no point rearranging the order or if/else like we used to do 20 years ago (still necessary for microprocessors though).
The register optimization will display a clean cut, which will affect some intensive calculations primarily. Basically the processor runs an optimization to decide in which registers the variables being used will reside on. When you make a call, only a couple of them will be guaranteed to be preserved, all the others will need to be reloaded when the function returns. If the number of variables active is large, that load/unload can affect performance but that is really rare.
If the function is a virtual method, the indirect lookup on the virtual table might add up to ten cycles. Compilers might de-virtualize a call if it knows exactly which class will be called however so this cost might be actually the same of a normal function. In more complex cases, with several layers of polymorphism then virtual calls might take up to 20 cycles. On my tests with 2 layers the cost is in average 5-7 cycles on an AMD Zen3 (Threadripper).
But overall if the function call is not virtual, the cost will be really negligible. There are programmers that swear by inlining everything but if my experience is worth note, I have programatically generated code 100% inlined and the same code compiled in separate and the performance was largely the same.
There is some function call overhead in C++, but a simple function like this that just returns a known variable will probably be compiled out and replaced with a reference to that variable.

C++: When is method redefinition preferred over virtual method override? [duplicate]

I know that virtual functions have an overhead of dereferencing to call a method. But I guess with modern architectural speed it is almost negligible.
Is there any particular reason why all functions in C++ are not virtual as in Java?
From my knowledge, defining a function virtual in a base class is sufficient/necessary. Now when I write a parent class, I might not know which methods would get over-ridden. So does that mean that while writing a child class someone would have to edit the parent class. This sounds like inconvenient and sometimes not possible?
Update:
Summarizing from Jon Skeet's answer below:
It's a trade-off between explicitly making someone realize that they are inheriting functionality [which has potential risks in themselves [(check Jon's response)] [and potential small performance gains] with a trade-off for less flexibility, more code changes, and a steeper learning curve.
Other reasons from different answers:
Virtual functions cannot be in-lined because inlining have to happen at runtime. This have performance impacts when you expect you functions benefits from inlining.
There might be potentially other reasons, and I would love to know and summarize them.
There are good reasons for controlling which methods are virtual beyond performance. While I don't actually make most of my methods final in Java, I probably should... unless a method is designed to be overridden, it probably shouldn't be virtual IMO.
Designing for inheritance can be tricky - in particular it means you need to document far more about what might call it and what it might call. Imagine if you have two virtual methods, and one calls the other - that must be documented, otherwise someone could override the "called" method with an implementation which calls the "calling" method, unwittingly creating a stack overflow (or infinite loop if there's tail call optimization). At that point you've then got less flexibility in your implementation - you can't switch it round at a later date.
Note that C# is a similar language to Java in various ways, but chose to make methods non-virtual by default. Some other people aren't keen on this, but I certainly welcome it - and I'd actually prefer that classes were uninheritable by default too.
Basically, it comes down to this advice from Josh Bloch: design for inheritance or prohibit it.
One of the main C++ principles is: you only pay for what you use ("zero overhead principle"). If you don't need the dynamic dispatch mechanism, you shouldn't pay for its overhead.
As the author of the base class, you should decide which methods should be allowed to be overridden. If you're writing both, go ahead and refactor what you need. But it works this way, because there has to be a way for the author of the base class to control its use.
But I guess with modern architectural speed it is almost negligible.
This assumption is wrong, and, I guess, the main reason for this decision.
Consider the case of inlining. C++’ sort function performs much faster than C’s otherwise similar qsort in some scenarios because it can inline its comparator argument, while C cannot (due to use of function pointers). In extreme cases, this can mean performance differences of as much as 700% (Scott Meyers, Effective STL).
The same would be true for virtual functions. We’ve had similar discussions before; for instance, Is there any reason to use C++ instead of C, Perl, Python, etc?
Most answers deal with the overhead of virtual functions, but there are other reasons not to make any function in a class virtual, as the fact that it will change the class from standard-layout to, well, non-standard-layout, and that can be a problem if you need to serialize binary data. That is solved differently in C#, for example, by having structs being a different family of types than classes.
From the design point of view, every public function establishes a contract between your type and the users of the type, and every virtual function (public or not) establishes a different contract with the classes that extend your type. The greater the number of such contracts that you sign the less room for changes that you have. As a matter of fact, there are quite a few people, including some well known writers, that defend that the public interface should never contain virtual functions, as your compromise to your clients might be different from the compromises you require from your extensions. That is, the public interfaces shows what you do for your clients, while the virtual interface shows how others might help you in doing it.
Another effect of virtual functions is that they always get dispatched to the final overrider (unless you explicitly qualify the call), and that means that any function that is needed to maintain your invariants (think the state of the private variables) should not be virtual: if a class extends it, it will have to either make an explicit qualified call back to the parent or else would break the invariants at your level.
This is similar to the example of the infinite loop/stack overflow that #Jon Skeet mentioned, just in a different way: you have to document in each function whether it accesses any private attributes so that extensions will ensure that the function is called at the right time. And that in turn means that you are breaking encapsulation and you have a leaking abstraction: Your internal details are now part of the interface (documentation + requirements on your extensions), and you cannot modify them as you wish.
Then there is performance... there will be an impact in performance, but in most cases that is overrated, and it could be argued that only in the few cases where performance is critical, you would fall back and declare the functions non-virtual. Then again, that might not be simple on a built product, since the two interfaces (public + extensions) are already bound.
You forget one thing. The overhead is also in memory, that is you add a virtual table and a pointer to that table for each object. Now if you have an object which has significant number of instances expected then it is not negligible. example, million instance equals 4 Mega byte. I agree that for simple application this is not much, but for real time devices such as routers this counts.
I'm rather late to the party here, so I'll add one thing that I haven't noticed covered in other answers, and summarise quickly...
Usability in shared memory: a typical implementation of virtual dispatch has a pointer to a class-specific virtual dispatch table in each object. The addresses in these pointers are specific to the process creating them, which means multi-process systems accessing objects in shared memory can't dispatch using another process's object! That's an unacceptable limitation given shared memory's importance in high-performance multi-process systems.
Encapsulation: the ability of a class designer to control the members accessed by client code, ensuring class semantics and invariants are maintained. For example, if you derive from std::string (I may get a few comments for daring to suggest that ;-P) then you can use all the normal insert / erase / append operations and be sure that - provided you don't do anything that's always undefined behaviour for std::string like pass bad position values to functions - the std::string data will be sound. Someone checking or maintaining your code doesn't have to check if you've changed the meaning of those operations. For a class, encapsulation ensures freedom to later modify the implementation without breaking client code. Another perspective on the same statement: client code can use the class any way it likes without being sensitive to the implementation details. If any function can be changed in a derived class, that whole encapsulation mechanism is simply blown away.
Hidden dependencies: when you know neither what other functions are dependent on the one you're overriding, nor that the function was designed to be overridden, then you can't reason about the impact of your change. For example, you think "I've always wanted this", and change std::string::operator[]() and at() to consider negative values (after a type-cast to signed) to be offsets backwards from the end of the string. But, perhaps some other function was using at() as a kind of assertion that an index was valid - knowing it'll throw otherwise - before attempting an insertion or deletion... that code might go from throwing in a Standard-specified way to having undefined (but likely lethal) behaviour.
Documentation: by making a function virtual, you're documenting that it is an intended point of customisation, and part of the API for client code to use.
Inlining - code side & CPU usage: virtual dispatch complicates the compiler's job of working out when to inline function calls, and could therefore provide worse code in terms of both space/bloat and CPU usage.
Indirection during calls: even if an out-of-line call is being made either way, there's a small performance cost for virtual dispatch that may be significant when calling trivially simple functions repeatedly in performance critical systems. (You have to read the per-object pointer to the virtual dispatch table, then the virtual dispatch table entry itself - means the VDT pages are consuming cache too.)
Memory usage: the per-object pointers to virtual dispatch tables may represent significant wasted memory, especially for arrays of small objects. This means less objects fit in cache, and can have a significant performance impact.
Memory layout: it's essential for performance, and highly convenient for interoperability, that C++ can define classes with the exact memory layout of member data specified by network or data standards of various libraries and protocols. That data often comes from outside your C++ program, and may be generated in another language. Such communications and storage protocols won't have "gaps" for pointers to virtual dispatch tables, and as discussed earlier - even if they did, and the compiler somehow let you efficiently inject the correct pointers for your process over incoming data, that would frustrate multi-process access to the data. Crude-but-practical pointer/size based serialisation/deserialisation/comms code would also be made more complicated and potentially slower.
Pay per use (in Bjarne Stroustrup words).
Seems like this question might have some answers Virtual functions should not be used excessively - Why ?. In my opinion the one thing that stands out is that it just add more complexity in terms of knowing what can be done with inheritance.
Yes, it's because of performance overhead. Virtual methods are called using virtual tables and indirection.
In Java all methods are virtual and the overhead is also present. But, contrary to C++, the JIT compiler profiles the code during run-time and can in-line those methods which don't use this property. So, JVM knows where it's really needed and where not thus freeing You from making the decision on your own.
The issues is that while Java compiles to code that runs on a virtual machine, that same guarantee can't be made for C++. It common to use C++ as a more organized replacement for C, and C has a 1:1 translation to assembly.
If you consider that 9 out of 10 microprocessors in the world are not in a personal computer or a smartphone, you'll see the issue when you further consider that there are a lot of processors that need this low level access.
C++ was designed to avoid that hidden deferencing if you didn't need it, thus keeping that 1:1 nature. Some of the first C++ code actually had an intermediate step of being translated to C before running through a C-to-assembly compiler.
Java method calls are far more efficient than C++ due to runtime optimization.
What we need is to compile C++ into bytecode and run it on JVM.

Virtual Function Compared to Pointer Casting

The current version of some code I'm using utilises a slightly odd way of acheiving something which I think could be acheived with polymorphism. More concretely we currently use something like
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
if( W->type == someTypeA )
{
// do some things which also involve casts such as
// ((SomeClassA*) W->objectptr)->someFieldA
}
else if( W->type == someTypeB )
{
// do some things which also involve casting such as
// ((SomeClassB*) W->objectptr)->someFieldB
}
}
To clarify; each object W contains a void *objectptr; that is to say a pointer to an arbitrary location. The field W->type keeps track of what type of object objectptr points at so that inside our if/else statements we can cast W->objectptr to the correct type and use it's fields.
However, this seems inherently bad from a code design stand point for several reasons;
We have no guarantee that the object pointed to by W->objectptr actually matches what is said in W->type so the cast is inherently unsafe.
Every time we wish to add another type we must add another elseif statement and ensure W->type is set correctly.
It seems to be this would be much better solved with something like
class CObj
{
public:
virtual void doSomething(/* some params */)=0;
};
class SomeClassA : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldA;
}
class SomeClassB : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldB;
}
// sometime later...
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
W->doSomething(/* some params */);
}
This having been said there is the proviso that in this setting performace is important. This code will be called from a (relatively) tight loop.
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
EDIT: It occurs to me that accessing the fields through a pointer in this way could be as bad as vtable lookups anyway due to cache misses etc. Any thoughts on this?
---- EDIT 2: Also I forgot to mention (and I know it's a bit off the original topic), inside the if statements are many calls to member functions of the surrounding class. How would you design the structure so as to be able to call these from inside doSomething()?
I'm going to answer specifically on the performance angle, because I work in a perf-critical environment and a while ago I happened to run measurements on a similar case to work out the fastest solution.
If you are on an x86, PPC, or ARM processor, you want virtual functions in this situation. The performance cost of calling a virtual function is mostly the pipeline bubble induced by mispredicting an indirect branch. Because the instruction fetch stage of the CPU can't know where the computed jmp goes, it can't start fetching bytes from the target address until the branch executes, and thus you have a stall in the pipeline corresponding to the number of stages between the first fetch stage and the branch retire. (On the PPC I know best, that's something like 25 cycles.)
You also have the latency of loading the vtable pointer, but this is often hidden by instruction reordering (the compiler moves the load instruction so it starts several cycles before you actually need the result and the CPU does other work while the data cache sends you its electrons.)
With the if-cascade approach you instead have some number n of direct, conditional branches — where the target is known at compile time, but whether the jump is taken is determined at runtime. (ie, a jump-on-equal opcode.) In this case the CPU will make a guess (predict) at whether each branch is taken or not, and start fetching instructions accordingly. So, you will only have a bubble if the CPU guesses wrong. Since you are presumably calling this function with different input each time, it's going to mispredict at least one of these branches, and you'll have the exact same bubble that you would with virtuals. In fact, you'll have a whole lot more bubbles — one per if() conditional!
With virtual functions, there's also the risk of an additional data cache miss on loading the vtable, and an icache miss on the jump target. If this function is in a tight loop, then presumably you'll be looking up and calling the same subroutines a lot, and thus the vtable and function bodies will probably still be in cache. You could measure that if you wanted to be really sure.
Use virtual functions, this hypothetical optimization means nothing. What matters is code readability, maintainability and quality.
Optimize later with the aid of a profiler if you really need to tune hot spots. Making your code unmaintainable with that kind of crap is a road to failure.
Also, virtual functions will help you do unit tests, mock interfaces, etc.
Programming is about managing complexity....
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
C++ compilers should be able to implement virtual functions very efficiently, so I don't think there's a downside in using them. (And certainly a huge maintainability/readability benefit!) But you should measure to make sure.
The way they are typically implemented is that each object has a vtable pointer. (multiple pointers in case of multiple inheritance, but let's forget that for now) This has the following relative costs over non-virtual functions.
data space: one pointer per object
data space: one vtable per class (not per object!)
time: worstcase = two memory reads per function call (1 to get the vtable address, 1 to get the function address within the vtable). The offset in the vtable is known at compile time, because you know which function you're calling. There's no extra jumps.
Compare this with the costs of the non-OOP approach your existing software has.
data space: one type ID per object
code space: one if/else tree or switch statement each time you wish to call a function dependent on the object type
time: having to evaluate the if/else tree or switch statement.
I'd vote for the virtual function approach as actually being faster than the non-OOP approach, because it eliminates the need to take the time and figure out what type of object it is.
I had some experience with some largish (1M+ line I think) scientific computation code that was using a similar type based switch construct. They refactored to a properly polymorphic based approach and got a significant speedup. Exactly the opposite of what they expected!
Turned out the compiler was better able to optimise some things in that structure.
However this was a long time ago (8 years or so) .. so who knows what modern compilers will do. Don't guess - profile it.
As piotr says the right answer is probably virtual functions. You'll have to test.
But to address your concern about the casts:
Never use C-style casts in a C++ program use static_cast<>, dynamic_cast<> etc..
In your specific case, use dynamic_cast<>. At least then you will get an exception if the types are not properly related, which better than a wild crash.
CRTP would be a great idea for such kind of cases.
Edit: In your case,
template<class T>
class CObj
{
public:
void doSomething(/* some params */)
{
static_cast<T*>(this)->doSomething(...);
}
};
class SomeClassA : public CObj<SomeClassA>
{
public:
void doSomething(/* some params */);
int someFieldA;
};
class SomeClassB : public CObj<<SomeClassB>
{
public:
void doSomething(/* some params */);
int someFieldB;
};
Now you may have to choose your loop code in different way to accommodate all objects of different CObj<T> type.

What is the cost of using a pointer to member function vs. a switch?

I have the following situation:
class A
{
public:
A(int whichFoo);
int foo1();
int foo2();
int foo3();
int callFoo(); // cals one of the foo's depending on the value of whichFoo
};
In my current implementation I save the value of whichFoo in a data member in the constructor and use a switch in callFoo() to decide which of the foo's to call. Alternatively, I can use a switch in the constructor to save a pointer to the right fooN() to be called in callFoo().
My question is which way is more efficient if an object of class A is only constructed once, while callFoo() is called a very large number of times. So in the first case we have multiple executions of a switch statement, while in the second there is only one switch, and multiple calls of a member function using the pointer to it. I know that calling a member function using a pointer is slower than just calling it directly. Does anybody know if this overhead is more or less than the cost of a switch?
Clarification: I realize that you never really know which approach gives better performance until you try it and time it. However, in this case I already have approach 1 implemented, and I wanted to find out if approach 2 can be more efficient at least in principle. It appears that it can be, and now it makes sense for me to bother to implement it and try it.
Oh, and I also like approach 2 better for aesthetic reasons. I guess I am looking for a justification to implement it. :)
How sure are you that calling a member function via a pointer is slower than just calling it directly? Can you measure the difference?
In general, you should not rely on your intuition when making performance evaluations. Sit down with your compiler and a timing function, and actually measure the different choices. You may be surprised!
More info: There is an excellent article Member Function Pointers and the Fastest Possible C++ Delegates which goes into very deep detail about the implementation of member function pointers.
You can write this:
class Foo {
public:
Foo() {
calls[0] = &Foo::call0;
calls[1] = &Foo::call1;
calls[2] = &Foo::call2;
calls[3] = &Foo::call3;
}
void call(int number, int arg) {
assert(number < 4);
(this->*(calls[number]))(arg);
}
void call0(int arg) {
cout<<"call0("<<arg<<")\n";
}
void call1(int arg) {
cout<<"call1("<<arg<<")\n";
}
void call2(int arg) {
cout<<"call2("<<arg<<")\n";
}
void call3(int arg) {
cout<<"call3("<<arg<<")\n";
}
private:
FooCall calls[4];
};
The computation of the actual function pointer is linear and fast:
(this->*(calls[number]))(arg);
004142E7 mov esi,esp
004142E9 mov eax,dword ptr [arg]
004142EC push eax
004142ED mov edx,dword ptr [number]
004142F0 mov eax,dword ptr [this]
004142F3 mov ecx,dword ptr [this]
004142F6 mov edx,dword ptr [eax+edx*4]
004142F9 call edx
Note that you don't even have to fix the actual function number in the constructor.
I've compared this code to the asm generated by a switch. The switch version doesn't provide any performance increase.
To answer the asked question: at the finest-grained level, the pointer to the member function will perform better.
To address the unasked question: what does "better" mean here? In most cases I would expect the difference to be negligible. Depending on what the class it doing, however, the difference may be significant. Performance testing before worrying about the difference is obviously the right first step.
If you are going to keep using a switch, which is perfectly fine, then you probably should put the logic in a helper method and call if from the constructor. Alternatively, this is a classic case of the Strategy Pattern. You could create an interface (or abstract class) named IFoo which has one method with Foo's signature. You would have the constructor take in an instance of IFoo (constructor Dependancy Injection that implemented the foo method that you want. You would have a private IFoo that would be set with this constructor, and every time you wanted to call Foo you would call your IFoo's version.
Note: I haven't worked with C++ since college, so my lingo might be off here, ut the general ideas hold for most OO languages.
If your example is real code, then I think you should revisit your class design. Passing in a value to the constructor, and using that to change behaviour is really equivalent to creating a subclass. Consider refactoring to make it more explicit. The effect of doing so is that your code will end up using a function pointer (all virtual methods are, really, are function pointers in jump tables).
If, however your code was just a simplified example to ask whether, in general, jump tables are faster than switch statements, then my intuition would say that jump tables are quicker, but you are dependent on the compiler's optimisation step. But if performance is really such a concern, never rely on intuition - knock up a test program and test it, or look at the generated assembler.
One thing is certain, a switch statement will never be slower than a jump table. The reason being that the best a compiler's optimiser can do will be too turn a series of conditional tests (i.e. a switch) into a jump table. So if you really want to be certain, take the compiler out of the decision process and use a jump table.
Sounds like you should make callFoo a pure virtual function and create some subclasses of A.
Unless you really need the speed, have done extensive profiling and instrumenting, and determined that the calls to callFoo are really the bottleneck. Have you?
Function pointers are almost always better than chained-ifs. They make cleaner code, and are nearly always faster (except perhaps in a case where its only a choice between two functions and is always correctly predicted).
I should think that the pointer would be faster.
Modern CPUs prefetch instructions; mis-predicted branches flush the cache, which means it stalls while it refills the cache. A pointer doens't do that.
Of course, you should measure both.
Optimize only when needed
First: Most of the time you most likely do not care, the difference will be very small. Make sure optimizing this call really makes sense first. Only if your measurements show there is really significant time spent in the call overhead, proceed to optimizing it (shameless plug - Cf. How to optimize an application to make it faster?) If the optimization is not significant, prefer the more readable code.
Indirect call cost depends on target platform
Once you have determined it is worth to apply low-level optimization, then it is a time to understand your target platform. The cost you can avoid here is the branch misprediction penalty. On modern x86/x64 CPU this misprediction is likely to be very small (they can predict indirect calls quite well most of the time), but when targeting PowerPC or other RISC platforms, the indirect calls/jumps are often not predicted at all and avoiding them can cause significant performance gain. See also Virtual call cost depends on platform.
Compiler can implement switch using jump table as well
One gotcha: Switch can sometimes be implemented as an indirect call (using a table) as well, especially when switching between many possible values. Such switch exhibits the same misprediction as a virtual function. To make this optimization reliable, one would probably prefer using if instead of switch for the most common case.
Use timers to see which is quicker. Although unless this code is going to be over and over then it's unlikely that you'll notice any difference.
Be sure that if you are running code from the constructor that if the contruction fails that you wont leak memory.
This technique is used heavily with Symbian OS:
http://www.titu.jyu.fi/modpa/Patterns/pattern-TwoPhaseConstruction.html
If you are only calling callFoo() once, than most likely the function pointer will be slower by an insignificant amount. If you are calling it many times than most likely the function pointer will be faster by an insignificant amount (because it doesn't need to keep going through the switch).
Either way look at the assembled code to find out for sure it is doing what you think it is doing.
One often overlooked advantage to switch (even over sorting and indexing) is if you know that a particular value is used in the vast majority of cases.
It's easy to order the switch so that the most common are checked first.
ps. To reinforce greg's answer, if you care about speed - measure.
Looking at assembler doesn't help when CPUs have prefetch / predictive branching and pipeline stalls etc

AI Applications in C++: How costly are virtual functions? What are the possible optimizations?

In an AI application I am writing in C++,
there is not much numerical computation
there are lot of structures for which run-time polymorphism is needed
very often, several polymorphic structures interact during computation
In such a situation, are there any optimization techniques? While I won't care to optimize the application just now, one aspect of selecting C++ over Java for the project was to enable more leverage to optimize and to be able to use non-object oriented methods (templates, procedures, overloading).
In particular, what are the optimization techniques related to virtual functions? Virtual functions are implemented through virtual tables in memory. Is there some way to pre-fetch these virtual tables onto L2 cache (the cost of fetching from memory/L2 cache is increasing)?
Apart from this, are there good references for data locality techniques in C++? These techniques would reduce the wait time for data fetch into L2 cache needed for computation.
Update: Also see the following related forums: Performance Penalty for Interface, Several Levels of Base Classes
Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately:
classptr -> [vtable:4][classdata:x]
vtable -> [first:4][second:4][third:4][fourth:4][...]
first -> [code:x]
second -> [code:x]
...
The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache.
This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically.
CONST SEGMENT
??_7A##6B#
DD FLAT:?func1#A##UAEXXZ
DD FLAT:?func2#A##UAEXXZ
DD FLAT:?func3#A##UAEXXZ
CONST ENDS
The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn:
; A* pa;
; pa->func3();
mov eax, DWORD PTR _pa$[ebp]
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _pa$[ebp]
call DWORD PTR [edx+8]
In this example ebp, the stack frame base pointer, has the variable A* pa at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable.
If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.
Section 5.3.3 of the draft Technical Report on C++ Performance is entirely devoted to the overhead of virtual functions.
Have you actually profiled and found where, and what needs optimization?
Work on actually optimizing virtual function calls when you have found they actually are the bottleneck.
The only optimization I can think of is Java's JIT compiler. If I understand it correctly, it monitors the calls as the code runs, and if most calls go to particular implementation only, it inserts conditional jump to implementation when the class is right. This way, most of the time, there is no vtable lookup. Of course, for the rare case when we pass a different class, vtable is still used.
I am not aware of any C++ compiler/runtime that uses this technique.
Virtual functions tend to be a lookup and indirection function call. On some platforms, this is fast. On others, e.g., one popular PPC architecture used in consoles, this isn't so fast.
Optimizations usually revolve around expressing variability higher up in the callstack so that you don't need to invoke a virtual function multiple times within hotspots.
You can implement polymorfism in runtime using virtual functions and in compile time by using templates. You can replace virtual functions with templates. Take a look at this article for more information - http://www.codeproject.com/KB/cpp/SimulationofVirtualFunc.aspx
A solution to dynamic polymorphism could be static polymmorphism, usable if your types are known at compile type: The CRTP (Curiously recurring template pattern).
http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern
The explanation on Wikipedia is clear enough, and perhaps It could help you if you really determined virtual method calls were source of performance bottlenecks.
As already stated by the other answers, the actual overhead of a virtual function call is fairly small. It may make a difference in a tight loop where it is called millions of times per second, but it's rarely a big deal.
However, it may still have a bigger impact in that it's harder for the compiler to optimize. It can't inline the function call, because it doesn't know at compile-time which function will be called. That also makes some global optimizations harder. And how much performance does this cost you? It depends. It is usually nothing to worry about, but there are cases where it may mean a significant performance hit.
And of course it also depends on the CPU architecture. On some, it can become quite expensive.
But it's worth keeping in mind that any kind of runtime polymorphism carries more or less the same overhead. Implementing the same functionality via switch statements or similar, to select between a number of possible functions may not be cheaper.
The only reliable way to optimize this would be if you could move some of the work to compile-time. If it is possible to implement part of it as static polymorphism, some speedup may be possible.
But first, make sure you have a problem. Is the code actually too slow to be acceptable?
Second, find out what makes it slow through a profiler.
And third, fix it.
I'm reinforcing all answers that say in effect:
If you don't actually know it's a problem, any concern about fixing it is probably misplaced.
What you want to know is:
What fraction of execution time (when it's actually running) is spent in the process of invoking methods, and in particular, which methods are the most costly (by this measure).
Some profilers can give you this information indirectly. They need to summarize at the statement level, but exclusive of the time spent in the method itself.
My favorite technique is to just pause it a number of times under a debugger.
If the time spent in the process of virtual function invocations is significant, like say 20%, then on the average 1 out of 5 samples will show, at the bottom of the call stack, in the disassembly window, the instructions for following the virtual function pointer.
If you don't actually see that, it is not a problem.
In the process, you will probably see other things higher up the call stack, that actually are not needed and could save you a lot of time.
Static polymorphism, as some users answered here. For example, WTL uses this method. A clear explanation of the WTL implementation can be found at http://www.codeproject.com/KB/wtl/wtl4mfc1.aspx#atltemplates
Virtual calls do not present much greater overhead over normal functions. Although, the greatest loss is that a virtual function when called polymorphically cannot be inlined. And inlining will in a lot of situations represent some real gain in performance.
Something You can do to prevent wastage of that facility in some situations is to declare the function inline virtual.
Class A {
inline virtual int foo() {...}
};
And when you are at a point of code you are SURE about the type of the object being called, you may make an inline call that will avoid the polymorphic system and enable inlining by the compiler.
class B : public A {
inline virtual int foo()
{
//...do something different
}
void bar()
{
//logic...
B::foo();
// more logic
}
};
In this example, the call to foo() will be made non-polymorphic and bound to B implementation of foo(). But do it only when you know for sure what the instance type is, because the automatic polymorphism feature will be gone, and this is not very obvious for later code readers.
You rarely have to worry about cache in regards to such commonly used items, since they're fetched once and kept there.
Cache is only generally an issue when dealing with large data structures that either:
Are large enough and used for a very long time by a single function so that function can push everything else you need out of the cache, or
Are randomly accessed enough that the data structures themselves aren't necessarily in cache when you load from them.
Things like Vtables are generally not going to be a performance/cache/memory issue; usually there's only one Vtable per object type, and the object contains a pointer to the Vtable instead of the Vtable itself. So unless you have a few thousand types of objects, I don't think Vtables are going to thrash your cache.
1), by the way, is why functions like memcpy use cache-bypassing streaming instructions like movnt(dq|q) for extremely large (multi-megabyte) data inputs.
The cost is more or less the same than normal functions nowadays for recent CPUS, but they can't be inlined. If you call the function millions times, the impact can be significant (try calling millions of times the same function, for example, once with inline once without, and you will see it can be twice slower if the function itself does something simple; this is not a theoritical case: it is quite common for a lot of numerical computation).
With modern, ahead-looking, multiple-dispatching CPUs the overhead for a virtual function might well be zero. Nada. Zip.
If an AI application does not require great deal of number crunching, I wouldn't worry about performance disadvantage of virtual functions. There will be a marginal performance hit, only if they appear in the complex computations which are evaluated repeatedly. I don't think you can force virtual table to stay in L2 cache either.
There are a couple of optimizations available for virtual functions,
People have written compilers that resort to code analysis and transformation of the program. But, these aren't a production grade compilers.
You could replace all virtual functions with equivalent "switch...case" blocks to call appropriate functions based on the type in the hierarchy. This way you'll get rid of compiler managed virtual table and you'll have your own virtual table in the form of switch...case block. Now, chances of your own virtual table being in the L2 cache are high as it in the code path. Remember, you'll need RTTI or your own "typeof" function to achieve this.