The current version of some code I'm using utilises a slightly odd way of acheiving something which I think could be acheived with polymorphism. More concretely we currently use something like
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
if( W->type == someTypeA )
{
// do some things which also involve casts such as
// ((SomeClassA*) W->objectptr)->someFieldA
}
else if( W->type == someTypeB )
{
// do some things which also involve casting such as
// ((SomeClassB*) W->objectptr)->someFieldB
}
}
To clarify; each object W contains a void *objectptr; that is to say a pointer to an arbitrary location. The field W->type keeps track of what type of object objectptr points at so that inside our if/else statements we can cast W->objectptr to the correct type and use it's fields.
However, this seems inherently bad from a code design stand point for several reasons;
We have no guarantee that the object pointed to by W->objectptr actually matches what is said in W->type so the cast is inherently unsafe.
Every time we wish to add another type we must add another elseif statement and ensure W->type is set correctly.
It seems to be this would be much better solved with something like
class CObj
{
public:
virtual void doSomething(/* some params */)=0;
};
class SomeClassA : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldA;
}
class SomeClassB : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldB;
}
// sometime later...
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
W->doSomething(/* some params */);
}
This having been said there is the proviso that in this setting performace is important. This code will be called from a (relatively) tight loop.
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
EDIT: It occurs to me that accessing the fields through a pointer in this way could be as bad as vtable lookups anyway due to cache misses etc. Any thoughts on this?
---- EDIT 2: Also I forgot to mention (and I know it's a bit off the original topic), inside the if statements are many calls to member functions of the surrounding class. How would you design the structure so as to be able to call these from inside doSomething()?
I'm going to answer specifically on the performance angle, because I work in a perf-critical environment and a while ago I happened to run measurements on a similar case to work out the fastest solution.
If you are on an x86, PPC, or ARM processor, you want virtual functions in this situation. The performance cost of calling a virtual function is mostly the pipeline bubble induced by mispredicting an indirect branch. Because the instruction fetch stage of the CPU can't know where the computed jmp goes, it can't start fetching bytes from the target address until the branch executes, and thus you have a stall in the pipeline corresponding to the number of stages between the first fetch stage and the branch retire. (On the PPC I know best, that's something like 25 cycles.)
You also have the latency of loading the vtable pointer, but this is often hidden by instruction reordering (the compiler moves the load instruction so it starts several cycles before you actually need the result and the CPU does other work while the data cache sends you its electrons.)
With the if-cascade approach you instead have some number n of direct, conditional branches — where the target is known at compile time, but whether the jump is taken is determined at runtime. (ie, a jump-on-equal opcode.) In this case the CPU will make a guess (predict) at whether each branch is taken or not, and start fetching instructions accordingly. So, you will only have a bubble if the CPU guesses wrong. Since you are presumably calling this function with different input each time, it's going to mispredict at least one of these branches, and you'll have the exact same bubble that you would with virtuals. In fact, you'll have a whole lot more bubbles — one per if() conditional!
With virtual functions, there's also the risk of an additional data cache miss on loading the vtable, and an icache miss on the jump target. If this function is in a tight loop, then presumably you'll be looking up and calling the same subroutines a lot, and thus the vtable and function bodies will probably still be in cache. You could measure that if you wanted to be really sure.
Use virtual functions, this hypothetical optimization means nothing. What matters is code readability, maintainability and quality.
Optimize later with the aid of a profiler if you really need to tune hot spots. Making your code unmaintainable with that kind of crap is a road to failure.
Also, virtual functions will help you do unit tests, mock interfaces, etc.
Programming is about managing complexity....
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
C++ compilers should be able to implement virtual functions very efficiently, so I don't think there's a downside in using them. (And certainly a huge maintainability/readability benefit!) But you should measure to make sure.
The way they are typically implemented is that each object has a vtable pointer. (multiple pointers in case of multiple inheritance, but let's forget that for now) This has the following relative costs over non-virtual functions.
data space: one pointer per object
data space: one vtable per class (not per object!)
time: worstcase = two memory reads per function call (1 to get the vtable address, 1 to get the function address within the vtable). The offset in the vtable is known at compile time, because you know which function you're calling. There's no extra jumps.
Compare this with the costs of the non-OOP approach your existing software has.
data space: one type ID per object
code space: one if/else tree or switch statement each time you wish to call a function dependent on the object type
time: having to evaluate the if/else tree or switch statement.
I'd vote for the virtual function approach as actually being faster than the non-OOP approach, because it eliminates the need to take the time and figure out what type of object it is.
I had some experience with some largish (1M+ line I think) scientific computation code that was using a similar type based switch construct. They refactored to a properly polymorphic based approach and got a significant speedup. Exactly the opposite of what they expected!
Turned out the compiler was better able to optimise some things in that structure.
However this was a long time ago (8 years or so) .. so who knows what modern compilers will do. Don't guess - profile it.
As piotr says the right answer is probably virtual functions. You'll have to test.
But to address your concern about the casts:
Never use C-style casts in a C++ program use static_cast<>, dynamic_cast<> etc..
In your specific case, use dynamic_cast<>. At least then you will get an exception if the types are not properly related, which better than a wild crash.
CRTP would be a great idea for such kind of cases.
Edit: In your case,
template<class T>
class CObj
{
public:
void doSomething(/* some params */)
{
static_cast<T*>(this)->doSomething(...);
}
};
class SomeClassA : public CObj<SomeClassA>
{
public:
void doSomething(/* some params */);
int someFieldA;
};
class SomeClassB : public CObj<<SomeClassB>
{
public:
void doSomething(/* some params */);
int someFieldB;
};
Now you may have to choose your loop code in different way to accommodate all objects of different CObj<T> type.
Related
I am from Python and still new at c++.
Now I wonder if calling a function is slower in performance then calling the code of the func itself?
Some example.
struct mynum {
public:
int m_value = 0;
constexpr
int value() { return m_value; }
// Say we would create a func here.
// That wants to use the value of "m_value"
// Is it slower to use "value()" instead of "m_value"?
// Even if the difference is very small.
// Or is there indeed no difference because everything gets compiled.
void somefunc() {
if(value() == 0) {}
}
}
If the function body is available at the time it is called, there is a good chance the compiler will try to either automatically inline it (the "inline" keyword is just a hint) or leave it as a function body. In both cases you are probably in the best path as compilers are pretty good at this kind of decisions - or better than us.
If only the function prototype (the declaration) is known by the compiler and the body is defined in another compilation unit (*.cpp file) then there are a couple of hits you might take:
The processor pipeline (and speculative execution) might stall which may call you a few cycles although processors have become extremely efficient at these things in the past 10 years or so. Even dynamic branch optimization has become so good that there is no point rearranging the order or if/else like we used to do 20 years ago (still necessary for microprocessors though).
The register optimization will display a clean cut, which will affect some intensive calculations primarily. Basically the processor runs an optimization to decide in which registers the variables being used will reside on. When you make a call, only a couple of them will be guaranteed to be preserved, all the others will need to be reloaded when the function returns. If the number of variables active is large, that load/unload can affect performance but that is really rare.
If the function is a virtual method, the indirect lookup on the virtual table might add up to ten cycles. Compilers might de-virtualize a call if it knows exactly which class will be called however so this cost might be actually the same of a normal function. In more complex cases, with several layers of polymorphism then virtual calls might take up to 20 cycles. On my tests with 2 layers the cost is in average 5-7 cycles on an AMD Zen3 (Threadripper).
But overall if the function call is not virtual, the cost will be really negligible. There are programmers that swear by inlining everything but if my experience is worth note, I have programatically generated code 100% inlined and the same code compiled in separate and the performance was largely the same.
There is some function call overhead in C++, but a simple function like this that just returns a known variable will probably be compiled out and replaced with a reference to that variable.
I have C++ program that reads a config file when the binary is executed, creates a number of child class instances based on the config file, and then periodically iterates over these instances and calls their respective virtual functions.
Gprof is telling me that these function calls are taking up a lot of time (the aforementioned iteration happens very frequently), so I want to try to avoid the repeated virtual function calls somehow.
The code is similar to the following. Once the program populates vector v at the start of the program, this vector won't change anymore for the rest of the program, so it seems inefficient to repeatedly have to do a virtual table lookup every time I want to call f(). I would think there must be a way to cache or save the function pointers somehow, but I'm not sure how.
Would love any suggestions you have on speeding things up. Thank you!
Edit: Sorry, I forgot to mention that the function calls f() for the vector of Child instances has to be in order from 0 to v.size() - 1, so I can't group together the elements of v that have the same derived type.
Also, this was built with -O3 -std=c++14
class Parent {
public:
virtual void f() { }
};
class Child1 : public Parent {
public:
void f() { /* do stuff for child1 */ }
};
//...
class Child9 : public Parent {
public:
void f() { /* do stuff for child9 */ }
};
int main() {
vector<Parent*> v;
// read config file and add Child instances to v based on the file contents
while (true) {
// do other stuff
for (size_t i = 0; i != v.size(); ++i) {
v[i]->f(); // expensive to do the same virtual table lookups every loop!
}
}
};
Based on some of the questions and your answers in the comments, here are a couple of considerations.
1) Your problem (if there is one, your solution might already be close to optimal, depending on details you have not mentioned) is most likely somewhere else, not in the overhead of a virtual function call.
If you really run this in a tight loop, and there's not much going on in the implementations of f() that touches a lot of memory, your vtables probably remain in the L1 cache, and the virtual function call overhead will be absolutely minimal, if any, on modern hardware.
2) You say "the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third address" - this might not be as innocent as you expect. For reference, going to L1 cache will cost you about 3 cycles, going to RAM may cost as much as 60-200, depedning on your hardware.
If you have enough of these objects (so that keeping all of the memory they reference in L1 cache is not possible), and the memory locations they reference are basically random (so that prefetching is ineffective), and/or you touch enough things in the rest of your program (so that all the relevant data gets vacated from cache between the loops over your vector), the cost of fetching and storing the values from and to memory/lower levels of cache will outweigh the cost of the virtual function calls by orders of magnitude in the worst case.
3) You iterate over a vector of pointers to objects - not the objects themselves.
Depending on how you allocate the objects and how big they are, this might not be an issue - prefetching will do wonders for you if you allocate them in a tight loop and your allocator packs them nicely. If, however, you allocate/free a lot of other things and mix in the allocations of these objects in between, they may end up located sparsely and in basically random locations in memory; then iterating over them in the order of creation will involve a lot random reads from memory, which will again be far slower than any virtual function overhead.
4) You say "calls to f() for the vector of children has to be in order" - do they?
If they do, then you are out of luck in some ways. If, however, you can re-architect your system so that they can be called ordered by type, then there is a lot of speed to be gained in various aspects - you could probably allocate an array of each type of object (nice, dense packing in memory), iterate over them in order (prefetcher friendly), and call your f()'s in groups for a single, well known type (inlining friendly, instruction cache friendly).
5) And finally - if none of the above applies and your problem is really in virtual function calls (unlikely), then, yes, you can try storing a pointer to the exact function you need to call for each object in some fashion - either manually or by using one of the type erasure / duck typing methods others have suggested.
My main point is this - there a lot of performance benefits to be had from changing the architecture of your system in some ways.
Remember: accessing things that are already in L1/L2 cache is good, having to go to L3/RAM for data is worse; accessing memory in a sequential order is good, jumping all over memory is bad; calling the same method in a tight loop, potentially inlining it, is good, calling a lot of different methods in a tight loop is worse.
If this is a part of your program the performance of which really matters, you should consider changing the architecture of your system to allow for some of the previously mentioned optimizations. I know this may seem daunting, but that is the game we are playing. Sometimes you need to sacrifice "clean" OOP and abstractions for performance, if the problem you are solving allows for it.
Edit: For vector of arbitrary child types mixed in together, I recommend going with the virtual call.
If, depending on config, there were a vector of only one child type - or if you can separate the different types into separate containers, then this could be a case where compile time polymorphism might be an option instead of runtime one. For example:
template<class Child, class Range>
void f_for(Range& r) {
for (Parent* p : r) {
Child* c = static_cast<Child*>(p);
c->Child::f(); // use static dispatch to avoid virtual lookup
}
}
...
if (config)
f_for<Child1>(v);
else
f_for<Child2>(v);
Alternative to explicit static dispatch would be to mark the child class or the member function final.
You might even expand the static portion of the program so that you get to use vector<Child1> or vector<Child2> directly, avoiding the extra indirection. At this point the inheritance is not even necessary.
Although I wrote this example in C++, this code refactoring question also applies to any language that endorses OO, such as Java.
Basically I have a class A
class A
{
public:
void f1();
void f2();
//..
private:
m_a;
};
void A::f1()
{
assert(m_a);
m_a->h1()->h2()->GetData();
//..
}
void A::f2()
{
assert(m_a);
m_a->h1()->h2()->GetData();
//..
}
Will you guys create a new private data member m_f holding the pointer m_a->h1()->h2()? The benenif I can see is that it effectively eliminates the multi-level function calls which does simplify the code a lot.
But from another point of view, it creates an "unnecessary" data member which can be deduced from another existing data member m_a, which is kinda redundant?
I just come to a dilemma here. By far, I cannot convince myself to use one over the other.
Which do you guys prefer, any reason?
The fancy word for this technique is caching: you calculate a two-away reference once, and cache it in the object. In general, caching lets you "pay" with computer memory for speed-up of your computations.
If a profiler tells you that your code is spending a significant amount of time in the repeated call of m_a->h1()->h2(), this may be a legitimate optimization, provided that the return values of h1 and h2 never change. However, doing an optimization like that without profiling first is nearly always a bad sign of a premature optimization.
If performance is not the issue, a good rule is to stay away from storing members that can be calculated from other members stored in your object. If you would like to improve clarity, you can introduce a nicely named method (a member function) to calculate the two-away reference without storing it. Storing makes sense only in the rare cases when it is critical for the performance.
I would not. I agree it would simply things in your contrived example, but that's because m_a->h1()->h2() has no inherent meaning. In a well-designed application, the method names used should tell you something qualitative about the calls being made, and that should be a part of self-documenting code. I would argue that in properly designed code, m_a->h1()->h2() should be simpler to read and understand than redirecting to a private method which calls it for you.
Now, if m_a->h1()->h2() is an expensive call which takes a significant time to compute the result, then you might have an argument for caching as #dasblinkenlight suggests. But throwing away the descriptiveness of the method call for the sake of a few keypresses is bad.
Whenever I have something like this I usually store m_a->h1() into a variable with a meaningful name at function scope since it's likely to be used again later in function's body.
Say I have a virtual function call foo() on an abstract base class pointer, mypointer->foo(). When my app starts up, based on the contents of a file, it chooses to instantiate a particular concrete class and assigns mypointer to that instance. For the rest of the app's life, mypointer will always point to objects of that concrete type. I have no way to know what this concrete type is (it may be instantiated by a factory in a dynamically loaded library). I only know that the type will stay the same after the first time an instance of the concrete type is made. The pointer may not always point to the same object, but the object will always be of the same concrete type. Notice that the type is technically determined at 'runtime' because it's based on the contents of a file, but that after 'startup' (file is loaded) the type is fixed.
However, in C++ I pay the virtual function lookup cost every time foo is called for the entire duration of the app. The compiler can't optimize the look up away because there's no way for it to know that the concrete type won't vary at runtime (even if it was the most amazing compiler ever, it can't speculate on the behavior of dynamically loaded libraries). In a JIT compiled language like Java or .NET the JIT can detect that the same type is being used over and over and do inline cacheing. I'm basically looking for a way to manually do that for specific pointers in C++.
Is there any way in C++ to cache this lookup? I realize that solutions might be pretty hackish. I'm willing to accept ABI/compiler specific hacks if it's possible to write configure tests that discover the relevant aspects of the ABI/compiler so that it's "practically portable" even if not truly portable.
Update: To the naysayers: If this wasn't worth optimizing, then I doubt modern JITs would do it. Do you think Sun and MS's engineers were wasting their time implementing inline cacheing, and didn't benchmark it to ensure there was an improvement?
There are two costs to a virtual function call: The vtable lookup and the function call.
The vtable lookup is already taken care of by the hardware. Modern CPUs (assuming you're not working on a very simple embedded CPU) will predict the address of the virtual function in their branch predictor and speculatively execute it in parallel with the array lookup. The fact that the vtable lookup happens in parallel with the speculative execution of the function means that, when executed in a loop in the situations you describe, virtual function calls have next to zero overhead compared to direct, non-inlined function calls.
I've actually tested this in the past, albeit in the D programming language, not C++. When inlining was disabled in the compiler settings and I called the same function in a loop several million times, the timings were within epsilon of each other whether the function was virtual or not.
The second and more important cost of virtual functions is that they prevent inlining of the function in most cases. This is even more important than it sounds because inlining is an optimization that can enable several other optimizations such as constant folding in some cases. There's no way to inline a function without recompiling the code. JITs get around this because they're constantly recompiling code during the execution of your application.
Why virtual call is expensive? Because you simply don't know the branch target until the code is executed in runtime. Even modern CPUs are still perfectly handling the virtual call and indirect calls. One can't simply say it costs nothing because we just have a faster CPU. No, it is not.
1. How can we make it fast?
You already have pretty deep understanding the problem. But, the only I can say that if the virtual function call is easy to predict, then you could perform software-level optimization. But, if it's not (i.e., you have really no idea what would be the target of the virtual function), then I don't think that there is good solution for now. Even for CPU, it is hard to predict in such extreme case.
Actually, compilers such as Visual C++'s PGO(Profiling guided optimization) has virtual call speculation optimization (Link). If the profiling result can enumerate hot virtual function targets, then it translate to direct call which can be inlined. This is also called devirtualization. It can be also found in some Java dynamic optimizer.
2. To those one who say it's not necessary
If you're using script languages, C# and concern about the coding efficiency, yes, it's worthless. However, anyone who are eager to save a single cycle to obtain better performance, then indirect branch is still important problem. Even the latest CPUs are not good to handle virtual calls. One good example would be a virtual machine or interpreter, which usually have a very large switch-case. Its performance is pretty much related to the correct prediction of indirect branch. So, you can't simply say it's too low-level or not necessary. There are hundreds of people who are trying to improve the performance in the bottom. That's why you can simply ignore such details :)
3. Some boring computer architectural facts related to virtual functions
dsimcha has written a good answer for how CPU can handle virtual call effectively. But, it's not exactly correct. First, all modern CPUs have branch predictor, which literally predicts the outcomes of a branch to increase pipeline throughput (or, more parallelism in instruction level, or ILP. I can even say that single-thread CPU performance is solely depending on how much you can extract ILP from a single thread. Branch prediction is the most critical factor for obtaining higher ILP).
In branch prediction, there are two predictions: (1) direction (i.e., the branch is taken? or not taken? binary answer), and (2) branch target (i.e., where will I go? it's not binary answer). Based on the prediction, CPU speculatively execute the code. If the speculation is not correct, then CPU rollbacks and restarts from the mis-predicted branch. This is completely hidden from programmer's view. So, you don't really know what's going on inside the CPU unless you're profiling with VTune which gives branch misprediction rates.
In general, branch direction prediction is highly accurate(95%+), but it is still hard to predict branch targets, especially virtual calls and switch-case(i.e., jump table). Vrtual call is indirect branch which requires a more memory load, and also CPU requires branch target prediction. Modern CPUs like Intel's Nehalem and AMD's Phenom have specialized indirect branch target table.
However, I don't think looking up vtable incurs a lot of overhead. Yes, it requires a more memory load which can make cache miss. But, once vtable is loaded into cache, then it's pretty much cache hit. If you're also concerned with that cost, you may put prefetching code to load vtable in advance. But, the real difficulty of virtual function call is that CPU can't do great job to predict the target of virtual call, which may result in pipeline drain frequently due to misprediction of the target.
So assuming that this is a fundamental issue you want to solve (to avoid premature optimization arguments), and ignoring platform and compiler specific hackery, you can do one of two things, at opposite ends of complexity:
Provide a function as part of the .dll that internally simply calls the right member function directly. You pay the cost of an indirect jump, but at least you don't pay the cost of a vtable lookup. Your mileage may vary, but on certain platforms, you can optimize the indirect function call.
Restructure your application such that instead of calling a member function per instance, you call a single function that takes a collection of instances. Mike Acton has a wonderful post (with a particular platform and application type bent) on why and how you should do this.
All answers are dealing with the most simple scenario, where calling a virtual method only requires getting the address of the actual method to call. In the general case, when multiple and virtual inheritance come into play, calling a virtual method requires shifting the this pointer.
The method dispatch mechanism can be implemented in more than one way, but it is common to find that the entry in the virtual table is not the actual method to call, but rather some intermediate 'trampoline' code inserted by the compiler that relocates the this pointer prior to calling the actual method.
When the dispatch is the simplest, just an extra pointer redirection, then trying to optimize it does not make sense. When the problem is more complex, then any solution will be compiler dependent and hackerish. Moreover, you do not even know in what scenario you are: if the objects are loaded from dlls then you don't really know whether the actual instance returned belongs to a simple linear inheritance hierarchy or a more complex scenario.
I have seen situations where avoiding a virtual function call is beneficial. This does not look to me to be one of those cases because you really are using the function polymorphically. You are just chasing one extra address indirection, not a huge hit, and one that might be partially optimized away in some situations. If it really does matter, you may want to restructure your code so that type-dependent choices such as virtual function calls are made fewer times, pulled outside of loops.
If you really think it's worth giving it a shot, you can set a separate function pointer to a non-virtual function specific to the class. I might (but probably wouldn't) consider doing it this way.
class MyConcrete : public MyBase
{
public:
static void foo_nonvirtual(MyBase* obj);
virtual void foo()
{ foo_nonvirtual(this); }
};
void (*f_ptr)(MyBase* obj) = &MyConcrete::foo_nonvirtual;
// Call f_ptr instead of obj->foo() in your code.
// Still not as good a solution as restructuring the algorithm.
Other than making the algorithm itself a bit wiser, I suspect any attempt to manually optimize the virtual function call will cause more problems than it solves.
You can't use a method pointer because pointers to member functions aren't considered covariant return types. See the example below:
#include <iostream>
struct base;
struct der;
typedef void(base::*pt2base)();
typedef void(der::*pt2der)();
struct base {
virtual pt2base method() = 0;
virtual void testmethod() = 0;
virtual ~base() {}
};
struct der : base {
void testmethod() {
std::cout << "Hello from der" << std::endl;
}
pt2der method() { **// this is invalid because pt2der isn't a covariant of pt2base**
return &der::testmethod;
}
};
The other option would be to have the method declared pt2base method() but then the return would be invalid because der::testmethod is not of type pt2base.
Also even if you had a method that received a ptr or reference to the base type you would have to dynamically cast it to the derived type in that method to do anything particularly polymorphic which adds back in the cost we're trying to save.
So, what you basically want to do is convert runtime polymorphism into compile time polymorphism. Now you still need to build your app so that it can handle multiple "cases", but once it's decided which case is applicable to a run, that's it for the duration.
Here's a model of the runtime polymorphism case:
struct Base {
virtual void doit(int&)=0;
};
struct Foo : public Base {
virtual void doit(int& n) {--n;}
};
struct Bar : public Base {
virtual void doit(int& n) {++n;}
};
void work(Base* it,int& n) {
for (unsigned int i=0;i<4000000000u;i++) it->doit(n);
}
int main(int argc,char**) {
int n=0;
if (argc>1)
work(new Foo,n);
else
work(new Bar,n);
return n;
}
This takes ~14s to execute on my Core2, compiled with gcc 4.3.2 (32 bit Debian), -O3 option.
Now suppose we replace the "work" version with a templated version (templated on the concrete type it's going to be working on):
template <typename T> void work(T* it,int& n) {
for (unsigned int i=0;i<4000000000u;i++) it->T::doit(n);
}
main doesn't actually need to be updated, but note that the 2 calls to work now trigger instantiations of and calls to two different and type-specific functions (c.f the one polymorphic function previously).
Hey presto runs in 0.001s. Not a bad speed up factor for a 2 line change! However, note that the massive speed up is entirely due to the compiler, once the possibility of runtime polymorphism in the work function is eliminated, just optimizing away the loop and compiling the result directly into the code. But that actually makes an important point: in my experience the main gains from using this sort of trick come from the opportunities for improved inlining and optimisation they allow the compiler when a less-polymorphic, more specific function is generated, not from the mere removal of vtable indirection (which really is very cheap).
But I really don't recommend doing stuff like this unless profiling absolutely indicates runtime polymorphism is really hitting your performance. It'll also bite you as soon as someone subclasses Foo or Bar and tries to pass that into a function actually intended for its base.
You might find this related question interesting too.
I asked a very similar question recently, and got the answer that it's possible as a GCC extension, but not portably:
C++: Pointer to monomorphic version of virtual member function?
In particular, I also tried it with Clang and it doesn't support this extension (even though it supports many other GCC extensions).
Could you use a method pointer?
The objective here is that the compiler would load the pointer with the location of the resolved method or function. This would occur once. After the assignment, the code would access the method in a more direct fashion.
I know that a pointer to an object and accessing the method via the object point invokes run-time polymorphism. However, there should be a way to load a method pointer to a resolved method, avoiding the polymorphism and directly calling the function.
I've checked the community wiki to introduce more discussion.
I have the following situation:
class A
{
public:
A(int whichFoo);
int foo1();
int foo2();
int foo3();
int callFoo(); // cals one of the foo's depending on the value of whichFoo
};
In my current implementation I save the value of whichFoo in a data member in the constructor and use a switch in callFoo() to decide which of the foo's to call. Alternatively, I can use a switch in the constructor to save a pointer to the right fooN() to be called in callFoo().
My question is which way is more efficient if an object of class A is only constructed once, while callFoo() is called a very large number of times. So in the first case we have multiple executions of a switch statement, while in the second there is only one switch, and multiple calls of a member function using the pointer to it. I know that calling a member function using a pointer is slower than just calling it directly. Does anybody know if this overhead is more or less than the cost of a switch?
Clarification: I realize that you never really know which approach gives better performance until you try it and time it. However, in this case I already have approach 1 implemented, and I wanted to find out if approach 2 can be more efficient at least in principle. It appears that it can be, and now it makes sense for me to bother to implement it and try it.
Oh, and I also like approach 2 better for aesthetic reasons. I guess I am looking for a justification to implement it. :)
How sure are you that calling a member function via a pointer is slower than just calling it directly? Can you measure the difference?
In general, you should not rely on your intuition when making performance evaluations. Sit down with your compiler and a timing function, and actually measure the different choices. You may be surprised!
More info: There is an excellent article Member Function Pointers and the Fastest Possible C++ Delegates which goes into very deep detail about the implementation of member function pointers.
You can write this:
class Foo {
public:
Foo() {
calls[0] = &Foo::call0;
calls[1] = &Foo::call1;
calls[2] = &Foo::call2;
calls[3] = &Foo::call3;
}
void call(int number, int arg) {
assert(number < 4);
(this->*(calls[number]))(arg);
}
void call0(int arg) {
cout<<"call0("<<arg<<")\n";
}
void call1(int arg) {
cout<<"call1("<<arg<<")\n";
}
void call2(int arg) {
cout<<"call2("<<arg<<")\n";
}
void call3(int arg) {
cout<<"call3("<<arg<<")\n";
}
private:
FooCall calls[4];
};
The computation of the actual function pointer is linear and fast:
(this->*(calls[number]))(arg);
004142E7 mov esi,esp
004142E9 mov eax,dword ptr [arg]
004142EC push eax
004142ED mov edx,dword ptr [number]
004142F0 mov eax,dword ptr [this]
004142F3 mov ecx,dword ptr [this]
004142F6 mov edx,dword ptr [eax+edx*4]
004142F9 call edx
Note that you don't even have to fix the actual function number in the constructor.
I've compared this code to the asm generated by a switch. The switch version doesn't provide any performance increase.
To answer the asked question: at the finest-grained level, the pointer to the member function will perform better.
To address the unasked question: what does "better" mean here? In most cases I would expect the difference to be negligible. Depending on what the class it doing, however, the difference may be significant. Performance testing before worrying about the difference is obviously the right first step.
If you are going to keep using a switch, which is perfectly fine, then you probably should put the logic in a helper method and call if from the constructor. Alternatively, this is a classic case of the Strategy Pattern. You could create an interface (or abstract class) named IFoo which has one method with Foo's signature. You would have the constructor take in an instance of IFoo (constructor Dependancy Injection that implemented the foo method that you want. You would have a private IFoo that would be set with this constructor, and every time you wanted to call Foo you would call your IFoo's version.
Note: I haven't worked with C++ since college, so my lingo might be off here, ut the general ideas hold for most OO languages.
If your example is real code, then I think you should revisit your class design. Passing in a value to the constructor, and using that to change behaviour is really equivalent to creating a subclass. Consider refactoring to make it more explicit. The effect of doing so is that your code will end up using a function pointer (all virtual methods are, really, are function pointers in jump tables).
If, however your code was just a simplified example to ask whether, in general, jump tables are faster than switch statements, then my intuition would say that jump tables are quicker, but you are dependent on the compiler's optimisation step. But if performance is really such a concern, never rely on intuition - knock up a test program and test it, or look at the generated assembler.
One thing is certain, a switch statement will never be slower than a jump table. The reason being that the best a compiler's optimiser can do will be too turn a series of conditional tests (i.e. a switch) into a jump table. So if you really want to be certain, take the compiler out of the decision process and use a jump table.
Sounds like you should make callFoo a pure virtual function and create some subclasses of A.
Unless you really need the speed, have done extensive profiling and instrumenting, and determined that the calls to callFoo are really the bottleneck. Have you?
Function pointers are almost always better than chained-ifs. They make cleaner code, and are nearly always faster (except perhaps in a case where its only a choice between two functions and is always correctly predicted).
I should think that the pointer would be faster.
Modern CPUs prefetch instructions; mis-predicted branches flush the cache, which means it stalls while it refills the cache. A pointer doens't do that.
Of course, you should measure both.
Optimize only when needed
First: Most of the time you most likely do not care, the difference will be very small. Make sure optimizing this call really makes sense first. Only if your measurements show there is really significant time spent in the call overhead, proceed to optimizing it (shameless plug - Cf. How to optimize an application to make it faster?) If the optimization is not significant, prefer the more readable code.
Indirect call cost depends on target platform
Once you have determined it is worth to apply low-level optimization, then it is a time to understand your target platform. The cost you can avoid here is the branch misprediction penalty. On modern x86/x64 CPU this misprediction is likely to be very small (they can predict indirect calls quite well most of the time), but when targeting PowerPC or other RISC platforms, the indirect calls/jumps are often not predicted at all and avoiding them can cause significant performance gain. See also Virtual call cost depends on platform.
Compiler can implement switch using jump table as well
One gotcha: Switch can sometimes be implemented as an indirect call (using a table) as well, especially when switching between many possible values. Such switch exhibits the same misprediction as a virtual function. To make this optimization reliable, one would probably prefer using if instead of switch for the most common case.
Use timers to see which is quicker. Although unless this code is going to be over and over then it's unlikely that you'll notice any difference.
Be sure that if you are running code from the constructor that if the contruction fails that you wont leak memory.
This technique is used heavily with Symbian OS:
http://www.titu.jyu.fi/modpa/Patterns/pattern-TwoPhaseConstruction.html
If you are only calling callFoo() once, than most likely the function pointer will be slower by an insignificant amount. If you are calling it many times than most likely the function pointer will be faster by an insignificant amount (because it doesn't need to keep going through the switch).
Either way look at the assembled code to find out for sure it is doing what you think it is doing.
One often overlooked advantage to switch (even over sorting and indexing) is if you know that a particular value is used in the vast majority of cases.
It's easy to order the switch so that the most common are checked first.
ps. To reinforce greg's answer, if you care about speed - measure.
Looking at assembler doesn't help when CPUs have prefetch / predictive branching and pipeline stalls etc