I have kept hearing this statement. Switch..Case is Evil for code maintenance, but it provides better performance(since compiler can inline stuffs etc..). Virtual functions are very good for code maintenance, but they incur a performance penalty of two pointer indirections.
Say i have a base class with 2 subclasses(X and Y) and one virtual function, so there will be two virtual tables. The object has a pointer, based on which it will choose a virtual table. So for the compiler, it is more like
switch( object's function ptr )
{
case 0x....:
X->call();
break;
case 0x....:
Y->call();
};
So why should virtual function cost more, if it can get implemented this way, as the compiler can do the same in-lining and other stuff here. Or explain me, why is it decided not to implement the virtual function execution in this way?
Thanks,
Gokul.
The compiler can't do that because of the separate compilation model.
At the time the virtual function call is being compiled, there is no way for the compiler to know for sure how many different subclasses there are.
Consider this code:
// base.h
class base
{
public:
virtual void doit();
};
and this:
// usebase.cpp
#include "base.h"
void foo(base &b)
{
b.doit();
}
When the compiler is generating the virtual call in foo, it has no knowledge of which subclasses of base will exist at runtime.
Your question rests on misunderstandings about the way switches and virtual functions work. Rather than fill up this box with a long treatise on code generation, I'll give a few bullet points:
Switch statements aren't necessarily faster than virtual function calls, or inlined. You can learn more about the way that switch statements are turned into assembly here and here.
The thing that is slow about virtual function calls isn't the pointer lookups, it's the indirect branch. For complicated reasons having to do with the internal electronics of the CPU, for most modern processors it is faster to perform a "direct branch", where the destination address is encoded in the instruction, than an "indirect branch", where the address is computed at runtime. Virtual function calls and large switch statements are usually implemented as indirect branches.
In your example above, the switch is completely redundant. Once an object's member function pointer has been computed, the CPU can branch straight to it. Even if the linker was aware of every possible member object that existed in the executable, it would still be unnecessary to add that table lookup.
Here's some results from concrete tests. These particular results are from VC++ 9.0/x64:
Test Description: Time to test a global using a 10-way if/else if statement
CPU Time: 7.70 nanoseconds plus or minus 0.385
Test Description: Time to test a global using a 10-way switch statement
CPU Time: 2.00 nanoseconds plus or minus 0.0999
Test Description: Time to test a global using a 10-way sparse switch statement
CPU Time: 3.41 nanoseconds plus or minus 0.171
Test Description: Time to test a global using a 10-way virtual function class
CPU Time: 2.20 nanoseconds plus or minus 0.110
With sparse cases, the switch statement is substantially slower. With dense cases, the switch statement might be faster, but the switch and the virtual function dispatch overlap a bit, so while the switch is probably faster, the margin is so small we can't even be sure it is faster, not to mention being enough faster to care much about. If the cases in the switch statement are sparse at all, there's no real question that the virtual function call will be faster.
There is no branching in virtual dispatch. The vptr in your class points to a vtable, with a second pointer for the specific function at a constant offset.
Actually, if you'll have many virtual functions, switch-like branching will be slower than two pointer indirection. Performance of the current implementation doesn't depend on how many virtual functions you have.
Your statement about the branching when calling virtual function is wrong. There is not such thing in generated code. Take a look at the assembly code will give you a better idea.
In a nut shell, one general simplified implementation of C++ virtual function is: each class will have a virtual table (vbtl), and each instance of the class will have a virtual table pointer (vptr). The virtual table is basically a list of function pointers.
When you are calling a virtual function, say it is like:
class Base {};
class Derived {};
Base* pB = new Derived();
pB->someVirtualFunction();
The 'someVirtualFunction()' will have a corresponding index in the vtbl. And the call
pB->someVirtualFunction();
will be converted to something like:
pB->vptr[k](); //k is the index of the 'someVirtualFunction'.
In this way the function is actually called indirectly and it has the polymorphism.
I suggest you to read 'The C++ Object Model' by Stanley Lippman.
Also, the statement that virtual function call is slower than the switch-case is not accruate. It depends. As you can see above, a virtual function call is just 1 extra time of dereference compared to a regular function call. And with switch-case branching you'd have extra comparision logic (which introduce the chance of missing cache for CPU) which also consumes CPU cycles. I would say in most case, if not all, virutal function call should be faster than switch-case.
Saying definitively that switch/case is more or less performant than virtual calls is an over-generalization. The truth is that this will depend on many things, and will vary based on:
what compiler you are using
what optimizations are enabled
the overall characteristics of your program and how they impact those optimizations
If you are optimizing the code in your head as you write it, there is a good chance that you are making the wrong choice. Write the code in a human-readable and/or user-friendly way first, then run the entire executable through profiling tools. If this area of the code shows up as a hotspot, then try it both ways and see which is quantifiably better for your particular case.
These kind of optimizations are possible only by a repatching linker which should run as a part of C++ runtime.
The C++ runtime is more complex that even a new DLL load (with COM) will add new function pointers to the vtable. (think about pure virtual fns?)
Then compiler or linker both cannot do this optimization. switch/case is obviously faster than indirect call since prefetch in CPU is deterministic and pipelining is possible. But it will not work out in C++ because of this runtime extension of object's vtable.
Related
As part of a system design, we need to implement a factory pattern. In combination with the Factory pattern, we are also using CRTP, to provide a base set of functionality which can then be customized by the Derived classes.
Sample code below:
class FactoryInterface{
public:
virtual void doX() = 0;
};
//force all derived classes to implement custom_X_impl
template< typename Derived, typename Base = FactoryInterface>
class CRTP : public Base
{
public:
void doX(){
// do common processing..... then
static_cast<Derived*>(this)->custom_X_impl();
}
};
class Derived: public CRTP<Derived>
{
public:
void custom_X_impl(){
//do custom stuff
}
};
Although this design is convoluted, it does a provide a few benefits. All the calls after the initial virtual function call can be inlined. The derived class custom_X_impl call is also made efficiently.
I wrote a comparison program to compare the behavior for a similar implementation (tight loop, repeated calls) using function pointers and virtual functions. This design came out triumphs for gcc/4.8 with O2 and O3.
A C++ guru however told me yesterday, that any virtual function call in a large executing program can take a variable time, considering cache misses and I can achieve a potentially better performance using C style function table look-ups and gcc hotlisting of functions. However I still see 2x the cost in my sample program mentioned above.
My questions are as below:
1. Is the guru's assertion true? For either answers, are there any links I can refer.
2. Is there any low latency implementation which I can refer, has a base class invoking a custom function in a derived class, using function pointers?
3. Any suggestions on improving the design?
Any other feedback is always welcome.
Your guru refers to the hot attribute of the gcc compiler. The effect of this attribute is:
The function is optimized more aggressively and on many targets it is
placed into a special subsection of the text section so all hot
functions appear close together, improving locality.
So yes, in a very large code base, the hotlisted function may remain in cache ready to be executed without delay, because it avodis cache misses.
You can perfectly use this attribute for member functions:
struct X {
void test() __attribute__ ((hot)) {cout <<"hello, world !\n"; }
};
But...
When you use virtual functions the compiler generally generates a vtable that is shared between all objects of the class. This table is a table of pointers to functions. And indeed -- your guru is right -- nothing garantees that this table remains in cached memory.
But, if you manually create a "C-style" table of function pointers, the problem is EXACTLY THE SAME. While the function may remain in cache, nothing ensures that your function table remains in cache as well.
The main difference between the two approaches is that:
in the case of virtual functions, the compiler knows that the virtual function is a hot spot, and could decide to make sure to keep the vtable in cache as well (I don't know if gcc can do this or if there are plans to do so).
in the case of the manual function pointer table, your compiler will not easily deduce that the table belongs to a hot spot. So this attempt of manual optimization might very well backfire.
My opinion: never try to optimize yourself what a compiler can do much better.
Conclusion
Trust in your benchmarks. And trust your OS: if your function or your data is frequently acessed, there are high chances that a modern OS will take this into account in its virtual memry management, and whatever the compiler will generate.
I know a lot of questions on these topics have been asked before, but I have a specific case where speed is (moderately) important and the speed increase when using function pointers rather than virtual functions is about 25%. I wondered (for mostly academic reasons) why?
To give more detail, I am writing a simulation which consists of a series of Cells. The Cells are connected by Links which dictate the interactions between the Cells. The Link class has a virtual function called update(), which causes the two Cells it is linking to interact. It is virtual so that I can create different types of Links to give different types of interactions. For example, at the moment I am simulating inviscid flow, but I may want a link which has viscosity or applies a boundary condition.
The second way I can achieve the same affect is to pass a function pointer to the Link class and make the target of the function pointer a friend. I can now have a non-virtual update() which uses the function pointer. Derived classes can use pointers to different functions giving polymorphic behaviour.
When I build the two versions and profile with Very Sleepy, I find that the function pointer version is significantly faster than the virtual function version and that it appears that the Link class has been entirely optimised away - I just see calls from my main function to the functions pointed to.
I just wondered what made it easier for my compiler (MSVC++ 2012 Express) to optimise the function pointer case better than the virtual function case?
Some code below if it helps for the function pointer case, I'm sure it is obvious how the equivalent would be done with virtual functions.
void InviscidLinkUpdate( void * linkVoid )
{
InviscidLink * link=(InviscidLink*)linkVoid;
//do some stuff with the link
//e.g.
//link->param1=
}
void ViscousLinkUpdate( void * linkVoid )
{
ViscousLink * link=(ViscousLink*)linkVoid;
//do some stuff with the link
//e.g.
//link->param1=
}
class Link
{
public:
Link(Cell *cell1, Cell*cell2, float area, float timeStep, void (*updateFunc)( void * ))
:m_cell1(cell1), m_cell2(cell2), m_area(area), m_timeStep(timeStep), m_update(updateFunc)
~Link(){};
void update() {m_update( this );}
protected:
void (*const m_update)( void *, UNG_FLT );
Cell *m_cell1;
Cell *m_cell2;
float m_area;
float m_timeStep
//some other parameters I want to modify in update()
float param1;
float param2;
};
class InviscidLink : public Link
{
friend void InviscidLinkUpdate( void * linkVoid )
public:
InviscidLink(Cell *cell1, Cell*cell2, float area, float timeStep)
Link(cell1, cell2, area, timeStep, InvicedLinkUpdate)
{}
};
class ViscousLink : public Link
{
friend void ViscousLinkUpdate( void * linkVoid )
public:
ViscousLink(Cell *cell1, Cell*cell2, float area, float timeStep)
Link(cell1, cell2, area, timeStep, ViscousLinkUpdate)
{}
};
edit
I have now put the full source on GitHub - https://github.com/philrosenberg/ung
Compare commit 5ca899d39aa85fa3a86091c1202b2c4cd7147287 (the function pointer version) with commit aff249dbeb5dfdbdaefc8483becef53053c4646f (the virtual function version). Unfortunately I based the test project initially on a wxWidgets project in case I wanted to play with some graphical display so if you don't have wxWidgets then you will need to hack it into a command line project to compile it.
I used Very Sleepy to benchmark it
further edit:
milianw's comment about profile guided optimization turned out to be the solution, but as a comment I currently cannot mark it as the answer. Using the pro version of Visual Studio with the profile guided optimization gave similar runtimes as using inline functions. I guess this is the Virtual Call Speculation described at http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx. I still find it a bit odd that this code could be more easily optimized using function pointers instead of virtual functions, but I guess that is why everyone advises to TEST, rather than assume certain code is faster than another.
Two things I can think about that differs when using function pointers vs virtual functions :
Your class size will be smaller since it won't have a vftable allocated hence smaller size, more cache friendly
There's one indirection less with function pointer ( With virtual functions : Object Indirection, vftable indirection, virtual function indirection , with functors : Object indirection, functor indirection -> your update function is resolved at compile time, since it's not virtual)
As requested, here my comment again as an answer:
Try to use profile-guided optimizations here. Then, the profiler can potentially apply devirtualization to speed up the code. Also, don't forget to mark your implementations as final, which can further help. See e.g. http://channel9.msdn.com/Shows/C9-GoingNative/C9GoingNative-12-C-at-BUILD-2012-Inside-Profile-Guided-Optimization or the excellent GCC article series over at http://hubicka.blogspot.de/search/label/devirtualization.
The actual cost of a virtual function call is normally insignificant. However, virtual functions may, as you observed, impact the code speed considerably. The main reason is that normally a virtual function call is real function call - with adding a frame to the stack. It is so because virtual function calls are resolved in runtime.
If a function is not virtual, it is much easier for C++ compiler to inline it. The call is resolved in compilation time, so the compiler may replace the call with the body of the called function. This allows for much more aggressive optimisations - like doing some computations once only instead of each loop run, etc.
Based on the information provided here my best guess is that you're operating on a large number of objects, and that the one extra indirection induced by the virtual table is increasing cache misses to the point where the I/O to refetch from memory becomes measurable.
As another alternative have you considered using templates and either CRTP or a policy-based approach for the Link class? Depending on your needs it may be possible to entirely remove the dynamic dispatching.
I plan to partition my computation into a fine-grained framework of functions/classes which encapsulate a certain part.
Something like this, but with even more classes and typically longer parameter lists:
class Point{
Coordinates thisPoint;
Value getPointValue();
Point getPoint(Offset offset);
Point getNumNeighbors();
Point getNeighbor(int i);
// many more
}
class Operator{
void doOperation(Point p){
// calls some of the functions in Point
}
}
Clearly, this would be a good practice in any object oriented language. But it's intended to run on a CUDA GPU. What I don't know: When I qualify all these fine-grained functions as __device__ and call them in a kernel - how will they be implemented? Will I have a significant overhead for the calls of the member functions or will this be inlined or otherwise efficiently optimized? Normally, these functions are extremely short but called many, many times.
The GPU compiler will aggressively inline functions for performance reasonse. In that case, there should be no particular impact to performance.
If a function cannot be inlined, then the usual performance overhead would occur, involving the creation of a stack frame and a call to a function -just as you would observe on a CPU call to a non-inlined function.
If you have concerns about a specific example, you can create a short test code and look at the generated assembly language (SASS) by using cuobjdump -sass myexe and determine whether or not the function was inlined.
There are no general restrictions on inlining of __device__ functions that are class members/methods.
If I have this situation in C++ project:
1 base class 'Base' containing only pure virtual functions
1 class 'Derived', which is the only class which inherits (public) from 'Base'
Will the compiler generate a VTABLE?
It seems there would be no need because the project only contains 1 class to which a Base* pointer could possibly point (Derived), so this could be resolved compile time for all cases.
This is interesting if you want to do dependency injection for unit testing but don't want to incur the VTABLE lookup costs in production code.
I don't have hard data, but I have good reasons to say no, it won't turn virtual calls into static ones.
Usually, the compiler only sees a single compilation unit. It cannot know there's only a single subclass, because five months later you may write another subclass, compile it, get some ancient object files from the backup and link them all together.
While link-time optimizations do see the whole picture, they usually work on a far lower-level representation of the program. Such representation allow e.g. inlining of static calls, but don't represent inheritance information (except perhaps as optional metadata) and already have the virtual calls and vtables spelt out explicitly. I know this is the case for Clang and IIRC gcc's whole-program optimizations also work on some low-level IR (GIMPLE?).
Also note that with dynamic loading, you can still add more subclasses long after compilation and LTO. You may not need it, but if I was a compiler writer, I'd be weary of adding an optimization that allows people royally breaking virtual calls in very specific, hard-to-track-down circumstances.
It's rarely worth the trouble - if you don't need virtual calls (e.g. because you know you won't need any more subclasses), don't make stuff virtual. Review your design. If you need some polymorphism but not the full power of virtual, the curiously recurring template pattern may help.
The compiler doesn't have to use a vtable based implementation of virtual function dispatch at all so the answer to your question will be specific to the implementation that you are using.
The vtable is usually not only used for virtual functions, but it is also used to identify the class type when you do some dynamic_cast or when the program accesses the type_info for the class.
If the compiler detects that no virtual functions ever need a dynamic dispatch and none of the other features are used, it just could remove the vtable pointer as an optimization.
Obviously the compiler writer hasn't found it worth the trouble of doing this. Probably because it wouldn't be used very often.
I've a simple class that looks like Boost.Array. There are two template parameters T and N. One drawback of Boost.Array is, that every method that uses such an array, has to be a template with parameter N (T is OK). The consequence is that the whole program tends to be a template. One idea is to create an interface (abstract class with only pure virtual functions) that only depends on T (something like ArrayInterface). Now every other class only access the interface, and therefore needs only the template parameter T (which in contrast to N, is more or less always known). The drawback here is the overhead of the virtual call (more the missed opportunity to inline calls), if the interface is used. Until here only facts.
template<typename T>
class ArrayInterface {
public:
virtual ~ArrayInterface() {};
virtual T Get(int i) = 0;
};
template<typename T, int N>
class Array : ArrayInterface<T> {
public:
T Get(int i) { ... }
};
template<typename T, int N>
class ArrayWithoutInterface {
public:
T Get() { ... }
};
But my real problem lies somewhere else. When I extend Boost.Array with an Interface, a direct instantiation of Boost.Array gets slow (factor 4 in one case, where it matters). If I remove the interface, Boost.Array is as fast as before. I understand, if a method is called through ArrayInterface there is an overhead, that is OK. But I don't understand why the a call to a method gets slower if there is only an additional interface with only pure virtual methods and the class is called directly.
Array<int, 1000> a;
a.Get(0); // Slow
ArrayWithoutInterface<int, 1000> awi;
awi.Get(0); // Fast
GCC 4.4.3 and Clang 1.1 show the same behavior.
This behaviour is expected: you’re invoking a virtual method. Whether you’re invoking it directly or via a base class pointer isn’t relevant at first: in both cases, the call has got to go through the virtual function table.
For a simple call such as Get (that simply dereferences an array cell, presumably without bounds checking), this can indeed make a factor 4 difference.
Now, a good compiler could see that the added indirection isn’t necessary here since the dynamic type of the object (and hence the method call target) is known at compile time. I’m mildly surprised that GCC apparently doesn’t optimize this (did you compile with -O3?). Then again, it’s only an optimization.
I'd disagree with your conclusion that 'the whole program tends to be a template': it looks to me like you're trying to solve a non-problem.
However, it's unclear what you mean by 'extend Boost.Array with an interface': are you modifying the source of boost::array by introducing your interface? If so, every array instance you are creating has to drag a vtable pointer along, whether or not you use the virtual methods. The existence of virtual methods may also make the compiler wary of using aggressive optimizations on non-virtual methods possible in a purely header-defined class.
Edited: ...and of course, you are using the virtual method. It takes pretty advanced code analysis techniques for the compiler to be certain a virtual call can be optimized away.
Two reasons:
late binding is slow
virtual methods cannot be inlined
If you have a virtual method that never gets extended, it could be quite possible that the compiler is optimizing out the virtual part of the methods. During a normal non-virtual method call, the program flow will go directly from the caller to the method. However, when the method is marked virtual, the cpu must first jump to a virtual table, then find the method you're looking for, and then jump to that method.
Now this normally isn't too noticeable. If the method you're calling takes 100ms to execute, even if your virtual table lookup takes 1ms, it's not going to matter. But if, in the case of an array, your method takes 0.5ms to execute that 1ms performance drop is going to be quite noticeable.
There's not much you can do about it, except don't extend Boost.Array, or rename your methods so they don't override.