optimising function pointers and virtual functions in C++ - c++

I know a lot of questions on these topics have been asked before, but I have a specific case where speed is (moderately) important and the speed increase when using function pointers rather than virtual functions is about 25%. I wondered (for mostly academic reasons) why?
To give more detail, I am writing a simulation which consists of a series of Cells. The Cells are connected by Links which dictate the interactions between the Cells. The Link class has a virtual function called update(), which causes the two Cells it is linking to interact. It is virtual so that I can create different types of Links to give different types of interactions. For example, at the moment I am simulating inviscid flow, but I may want a link which has viscosity or applies a boundary condition.
The second way I can achieve the same affect is to pass a function pointer to the Link class and make the target of the function pointer a friend. I can now have a non-virtual update() which uses the function pointer. Derived classes can use pointers to different functions giving polymorphic behaviour.
When I build the two versions and profile with Very Sleepy, I find that the function pointer version is significantly faster than the virtual function version and that it appears that the Link class has been entirely optimised away - I just see calls from my main function to the functions pointed to.
I just wondered what made it easier for my compiler (MSVC++ 2012 Express) to optimise the function pointer case better than the virtual function case?
Some code below if it helps for the function pointer case, I'm sure it is obvious how the equivalent would be done with virtual functions.
void InviscidLinkUpdate( void * linkVoid )
{
InviscidLink * link=(InviscidLink*)linkVoid;
//do some stuff with the link
//e.g.
//link->param1=
}
void ViscousLinkUpdate( void * linkVoid )
{
ViscousLink * link=(ViscousLink*)linkVoid;
//do some stuff with the link
//e.g.
//link->param1=
}
class Link
{
public:
Link(Cell *cell1, Cell*cell2, float area, float timeStep, void (*updateFunc)( void * ))
:m_cell1(cell1), m_cell2(cell2), m_area(area), m_timeStep(timeStep), m_update(updateFunc)
~Link(){};
void update() {m_update( this );}
protected:
void (*const m_update)( void *, UNG_FLT );
Cell *m_cell1;
Cell *m_cell2;
float m_area;
float m_timeStep
//some other parameters I want to modify in update()
float param1;
float param2;
};
class InviscidLink : public Link
{
friend void InviscidLinkUpdate( void * linkVoid )
public:
InviscidLink(Cell *cell1, Cell*cell2, float area, float timeStep)
Link(cell1, cell2, area, timeStep, InvicedLinkUpdate)
{}
};
class ViscousLink : public Link
{
friend void ViscousLinkUpdate( void * linkVoid )
public:
ViscousLink(Cell *cell1, Cell*cell2, float area, float timeStep)
Link(cell1, cell2, area, timeStep, ViscousLinkUpdate)
{}
};
edit
I have now put the full source on GitHub - https://github.com/philrosenberg/ung
Compare commit 5ca899d39aa85fa3a86091c1202b2c4cd7147287 (the function pointer version) with commit aff249dbeb5dfdbdaefc8483becef53053c4646f (the virtual function version). Unfortunately I based the test project initially on a wxWidgets project in case I wanted to play with some graphical display so if you don't have wxWidgets then you will need to hack it into a command line project to compile it.
I used Very Sleepy to benchmark it
further edit:
milianw's comment about profile guided optimization turned out to be the solution, but as a comment I currently cannot mark it as the answer. Using the pro version of Visual Studio with the profile guided optimization gave similar runtimes as using inline functions. I guess this is the Virtual Call Speculation described at http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx. I still find it a bit odd that this code could be more easily optimized using function pointers instead of virtual functions, but I guess that is why everyone advises to TEST, rather than assume certain code is faster than another.

Two things I can think about that differs when using function pointers vs virtual functions :
Your class size will be smaller since it won't have a vftable allocated hence smaller size, more cache friendly
There's one indirection less with function pointer ( With virtual functions : Object Indirection, vftable indirection, virtual function indirection , with functors : Object indirection, functor indirection -> your update function is resolved at compile time, since it's not virtual)

As requested, here my comment again as an answer:
Try to use profile-guided optimizations here. Then, the profiler can potentially apply devirtualization to speed up the code. Also, don't forget to mark your implementations as final, which can further help. See e.g. http://channel9.msdn.com/Shows/C9-GoingNative/C9GoingNative-12-C-at-BUILD-2012-Inside-Profile-Guided-Optimization or the excellent GCC article series over at http://hubicka.blogspot.de/search/label/devirtualization.

The actual cost of a virtual function call is normally insignificant. However, virtual functions may, as you observed, impact the code speed considerably. The main reason is that normally a virtual function call is real function call - with adding a frame to the stack. It is so because virtual function calls are resolved in runtime.
If a function is not virtual, it is much easier for C++ compiler to inline it. The call is resolved in compilation time, so the compiler may replace the call with the body of the called function. This allows for much more aggressive optimisations - like doing some computations once only instead of each loop run, etc.

Based on the information provided here my best guess is that you're operating on a large number of objects, and that the one extra indirection induced by the virtual table is increasing cache misses to the point where the I/O to refetch from memory becomes measurable.
As another alternative have you considered using templates and either CRTP or a policy-based approach for the Link class? Depending on your needs it may be possible to entirely remove the dynamic dispatching.

Related

Compiler Optimizations - Function has no address

I have not used much pointers to member functions but I think that found some dangerous scenarios when using such pointers.
The problem comes when compiler decides not to assign address to function, because of some optimization. It happened with VS 2015 even in Debug, x86 (with disabled Optimization - /Od). I am refactoring one old system, moving some code in a common static library (common.lib) so to be able to be used from several projects. Even if not the best pattern, the old implementation depends heavily from function member pointers and I do not want to change this. For example, I added the interface ModuleBase to one very big old class to something like:
class ModuleBase
{
public:
typedef void (ModuleBase::*Main)() const; // moved from old module
virtual void FunctionMain() const = 0; // Function has no address, possibly due to compiler optimizations.
virtual void FunctionSecondary() const = 0; // Function has no address, possibly due to compiler optimizations.
};
class OldModule : public ModuleBase
{
public:
virtual void FunctionMain() const {};
virtual void FunctionSecondary() const {};
}
The idea was to move ModuleBase in the Static library, but OldModule to remain in the main EXE project. While ModuleBase was in the main project it worked fine but when I move it in the static Common.lib it start crashing! It took me about 2 days to finally notice that at several places the compiler decided (but only for the Static Library) not to assign addresses to FunctionMain, FunctionSecondary() and etc.. from ModuleBase. So when pointers to these virtual functions were passed to other routines they were zeroes.
For example in the code bellow:
new Manager::ModuleDecription(
"Test Module",
"Secondary Scene",
"Description"
PosX,
PosY,
Proc,
&ModuleBase::FunctionSecondary //contains nullptr when in static library!!!!!
The last member in the structure was zero but only when is in the static library. It was quite nasty because I had to check many other things before to notice this. Also there are other pointers which were not zero because the structure was not zeroed in the constructor so one has to notice that address value is different and crashes when trying to call the function.
So my questions are -
1) Am I seeing this right - is this valid situation (that compiler is removing functions addresses, for the same code when moved in a static library)?
2) How to force compiler always to keep the member function addresses?
My apology, I found no problems with the addresses of pointers-to-members-functions in Visual Studio. Pointers to the base interface virtual functions are resolved Ok, even if placed in a Static Library. Reasons for my problems were:
1) Debugger sometimes shows function addresses of template classes as zeroes
2) Reason for the crashes was that the main project had the /vmg compiler option, but I missed to put it in the Static Library project. In such case one should be careful to use /vmg always in all referenced library projects (complications because of it is another topic).
Anyway, using pointers-to-members functions together with the object pointer is usually a sign of bad underlying design.
I hope this may help someone.

C++ Low latency Design: Function Dispatch v/s CRTP for Factory implementation

As part of a system design, we need to implement a factory pattern. In combination with the Factory pattern, we are also using CRTP, to provide a base set of functionality which can then be customized by the Derived classes.
Sample code below:
class FactoryInterface{
public:
virtual void doX() = 0;
};
//force all derived classes to implement custom_X_impl
template< typename Derived, typename Base = FactoryInterface>
class CRTP : public Base
{
public:
void doX(){
// do common processing..... then
static_cast<Derived*>(this)->custom_X_impl();
}
};
class Derived: public CRTP<Derived>
{
public:
void custom_X_impl(){
//do custom stuff
}
};
Although this design is convoluted, it does a provide a few benefits. All the calls after the initial virtual function call can be inlined. The derived class custom_X_impl call is also made efficiently.
I wrote a comparison program to compare the behavior for a similar implementation (tight loop, repeated calls) using function pointers and virtual functions. This design came out triumphs for gcc/4.8 with O2 and O3.
A C++ guru however told me yesterday, that any virtual function call in a large executing program can take a variable time, considering cache misses and I can achieve a potentially better performance using C style function table look-ups and gcc hotlisting of functions. However I still see 2x the cost in my sample program mentioned above.
My questions are as below:
1. Is the guru's assertion true? For either answers, are there any links I can refer.
2. Is there any low latency implementation which I can refer, has a base class invoking a custom function in a derived class, using function pointers?
3. Any suggestions on improving the design?
Any other feedback is always welcome.
Your guru refers to the hot attribute of the gcc compiler. The effect of this attribute is:
The function is optimized more aggressively and on many targets it is
placed into a special subsection of the text section so all hot
functions appear close together, improving locality.
So yes, in a very large code base, the hotlisted function may remain in cache ready to be executed without delay, because it avodis cache misses.
You can perfectly use this attribute for member functions:
struct X {
void test() __attribute__ ((hot)) {cout <<"hello, world !\n"; }
};
But...
When you use virtual functions the compiler generally generates a vtable that is shared between all objects of the class. This table is a table of pointers to functions. And indeed -- your guru is right -- nothing garantees that this table remains in cached memory.
But, if you manually create a "C-style" table of function pointers, the problem is EXACTLY THE SAME. While the function may remain in cache, nothing ensures that your function table remains in cache as well.
The main difference between the two approaches is that:
in the case of virtual functions, the compiler knows that the virtual function is a hot spot, and could decide to make sure to keep the vtable in cache as well (I don't know if gcc can do this or if there are plans to do so).
in the case of the manual function pointer table, your compiler will not easily deduce that the table belongs to a hot spot. So this attempt of manual optimization might very well backfire.
My opinion: never try to optimize yourself what a compiler can do much better.
Conclusion
Trust in your benchmarks. And trust your OS: if your function or your data is frequently acessed, there are high chances that a modern OS will take this into account in its virtual memry management, and whatever the compiler will generate.

interface overhead

I've a simple class that looks like Boost.Array. There are two template parameters T and N. One drawback of Boost.Array is, that every method that uses such an array, has to be a template with parameter N (T is OK). The consequence is that the whole program tends to be a template. One idea is to create an interface (abstract class with only pure virtual functions) that only depends on T (something like ArrayInterface). Now every other class only access the interface, and therefore needs only the template parameter T (which in contrast to N, is more or less always known). The drawback here is the overhead of the virtual call (more the missed opportunity to inline calls), if the interface is used. Until here only facts.
template<typename T>
class ArrayInterface {
public:
virtual ~ArrayInterface() {};
virtual T Get(int i) = 0;
};
template<typename T, int N>
class Array : ArrayInterface<T> {
public:
T Get(int i) { ... }
};
template<typename T, int N>
class ArrayWithoutInterface {
public:
T Get() { ... }
};
But my real problem lies somewhere else. When I extend Boost.Array with an Interface, a direct instantiation of Boost.Array gets slow (factor 4 in one case, where it matters). If I remove the interface, Boost.Array is as fast as before. I understand, if a method is called through ArrayInterface there is an overhead, that is OK. But I don't understand why the a call to a method gets slower if there is only an additional interface with only pure virtual methods and the class is called directly.
Array<int, 1000> a;
a.Get(0); // Slow
ArrayWithoutInterface<int, 1000> awi;
awi.Get(0); // Fast
GCC 4.4.3 and Clang 1.1 show the same behavior.
This behaviour is expected: you’re invoking a virtual method. Whether you’re invoking it directly or via a base class pointer isn’t relevant at first: in both cases, the call has got to go through the virtual function table.
For a simple call such as Get (that simply dereferences an array cell, presumably without bounds checking), this can indeed make a factor 4 difference.
Now, a good compiler could see that the added indirection isn’t necessary here since the dynamic type of the object (and hence the method call target) is known at compile time. I’m mildly surprised that GCC apparently doesn’t optimize this (did you compile with -O3?). Then again, it’s only an optimization.
I'd disagree with your conclusion that 'the whole program tends to be a template': it looks to me like you're trying to solve a non-problem.
However, it's unclear what you mean by 'extend Boost.Array with an interface': are you modifying the source of boost::array by introducing your interface? If so, every array instance you are creating has to drag a vtable pointer along, whether or not you use the virtual methods. The existence of virtual methods may also make the compiler wary of using aggressive optimizations on non-virtual methods possible in a purely header-defined class.
Edited: ...and of course, you are using the virtual method. It takes pretty advanced code analysis techniques for the compiler to be certain a virtual call can be optimized away.
Two reasons:
late binding is slow
virtual methods cannot be inlined
If you have a virtual method that never gets extended, it could be quite possible that the compiler is optimizing out the virtual part of the methods. During a normal non-virtual method call, the program flow will go directly from the caller to the method. However, when the method is marked virtual, the cpu must first jump to a virtual table, then find the method you're looking for, and then jump to that method.
Now this normally isn't too noticeable. If the method you're calling takes 100ms to execute, even if your virtual table lookup takes 1ms, it's not going to matter. But if, in the case of an array, your method takes 0.5ms to execute that 1ms performance drop is going to be quite noticeable.
There's not much you can do about it, except don't extend Boost.Array, or rename your methods so they don't override.

Virtual Function Implementation

I have kept hearing this statement. Switch..Case is Evil for code maintenance, but it provides better performance(since compiler can inline stuffs etc..). Virtual functions are very good for code maintenance, but they incur a performance penalty of two pointer indirections.
Say i have a base class with 2 subclasses(X and Y) and one virtual function, so there will be two virtual tables. The object has a pointer, based on which it will choose a virtual table. So for the compiler, it is more like
switch( object's function ptr )
{
case 0x....:
X->call();
break;
case 0x....:
Y->call();
};
So why should virtual function cost more, if it can get implemented this way, as the compiler can do the same in-lining and other stuff here. Or explain me, why is it decided not to implement the virtual function execution in this way?
Thanks,
Gokul.
The compiler can't do that because of the separate compilation model.
At the time the virtual function call is being compiled, there is no way for the compiler to know for sure how many different subclasses there are.
Consider this code:
// base.h
class base
{
public:
virtual void doit();
};
and this:
// usebase.cpp
#include "base.h"
void foo(base &b)
{
b.doit();
}
When the compiler is generating the virtual call in foo, it has no knowledge of which subclasses of base will exist at runtime.
Your question rests on misunderstandings about the way switches and virtual functions work. Rather than fill up this box with a long treatise on code generation, I'll give a few bullet points:
Switch statements aren't necessarily faster than virtual function calls, or inlined. You can learn more about the way that switch statements are turned into assembly here and here.
The thing that is slow about virtual function calls isn't the pointer lookups, it's the indirect branch. For complicated reasons having to do with the internal electronics of the CPU, for most modern processors it is faster to perform a "direct branch", where the destination address is encoded in the instruction, than an "indirect branch", where the address is computed at runtime. Virtual function calls and large switch statements are usually implemented as indirect branches.
In your example above, the switch is completely redundant. Once an object's member function pointer has been computed, the CPU can branch straight to it. Even if the linker was aware of every possible member object that existed in the executable, it would still be unnecessary to add that table lookup.
Here's some results from concrete tests. These particular results are from VC++ 9.0/x64:
Test Description: Time to test a global using a 10-way if/else if statement
CPU Time: 7.70 nanoseconds plus or minus 0.385
Test Description: Time to test a global using a 10-way switch statement
CPU Time: 2.00 nanoseconds plus or minus 0.0999
Test Description: Time to test a global using a 10-way sparse switch statement
CPU Time: 3.41 nanoseconds plus or minus 0.171
Test Description: Time to test a global using a 10-way virtual function class
CPU Time: 2.20 nanoseconds plus or minus 0.110
With sparse cases, the switch statement is substantially slower. With dense cases, the switch statement might be faster, but the switch and the virtual function dispatch overlap a bit, so while the switch is probably faster, the margin is so small we can't even be sure it is faster, not to mention being enough faster to care much about. If the cases in the switch statement are sparse at all, there's no real question that the virtual function call will be faster.
There is no branching in virtual dispatch. The vptr in your class points to a vtable, with a second pointer for the specific function at a constant offset.
Actually, if you'll have many virtual functions, switch-like branching will be slower than two pointer indirection. Performance of the current implementation doesn't depend on how many virtual functions you have.
Your statement about the branching when calling virtual function is wrong. There is not such thing in generated code. Take a look at the assembly code will give you a better idea.
In a nut shell, one general simplified implementation of C++ virtual function is: each class will have a virtual table (vbtl), and each instance of the class will have a virtual table pointer (vptr). The virtual table is basically a list of function pointers.
When you are calling a virtual function, say it is like:
class Base {};
class Derived {};
Base* pB = new Derived();
pB->someVirtualFunction();
The 'someVirtualFunction()' will have a corresponding index in the vtbl. And the call
pB->someVirtualFunction();
will be converted to something like:
pB->vptr[k](); //k is the index of the 'someVirtualFunction'.
In this way the function is actually called indirectly and it has the polymorphism.
I suggest you to read 'The C++ Object Model' by Stanley Lippman.
Also, the statement that virtual function call is slower than the switch-case is not accruate. It depends. As you can see above, a virtual function call is just 1 extra time of dereference compared to a regular function call. And with switch-case branching you'd have extra comparision logic (which introduce the chance of missing cache for CPU) which also consumes CPU cycles. I would say in most case, if not all, virutal function call should be faster than switch-case.
Saying definitively that switch/case is more or less performant than virtual calls is an over-generalization. The truth is that this will depend on many things, and will vary based on:
what compiler you are using
what optimizations are enabled
the overall characteristics of your program and how they impact those optimizations
If you are optimizing the code in your head as you write it, there is a good chance that you are making the wrong choice. Write the code in a human-readable and/or user-friendly way first, then run the entire executable through profiling tools. If this area of the code shows up as a hotspot, then try it both ways and see which is quantifiably better for your particular case.
These kind of optimizations are possible only by a repatching linker which should run as a part of C++ runtime.
The C++ runtime is more complex that even a new DLL load (with COM) will add new function pointers to the vtable. (think about pure virtual fns?)
Then compiler or linker both cannot do this optimization. switch/case is obviously faster than indirect call since prefetch in CPU is deterministic and pipelining is possible. But it will not work out in C++ because of this runtime extension of object's vtable.

Am I abusing Policies?

I find myself using policies a lot in my code and usually I'm very happy with that.
But from time to time I find myself confronted with using that pattern in situations where the Policies are selected and runtime and I have developed habbits to work around such situations. Usually I start with something like that:
class DrawArrays {
protected:
void sendDraw() const;
};
class DrawElements {
public:
void setIndices( GLenum mode, GLsizei count, GLenum type, const GLvoid *indices);
protected:
void sendDraw() const;
};
template<class Policy>
class Vertices : public Policy {
using Policy::sendDraw();
public:
void render() const;
};
When the policy is picked at runtime I have different choices of working around the situation.
Different code paths:
if(drawElements) {
Vertices<DrawElements> vertices;
} else {
Vertices<DrawArrays> vertices;
}
Inheritance and virtual calls:
class PureVertices {
public:
void render()=0;
};
template<class Policy>
class Vertices : public PureVertices, public Policy {
//..
};
Both solutions feel wrong to me. The first creates an umaintainable mess and the second introduces the overhead of virtual calls that I tried to avoid by using policies in the first place.
Am I missing the proper solutions or do I use the wrong pattern to solve the problem?
Use the second version. Virtual calls are more expensive than static calls because they require an additional pointer lookup, but if "sendDraw" does any real drawing, you won't notice the difference. If you really have a performance problem later, use a profiler to find out where the problem is and fix it. In the (extremely unlikely) case that the virtual method call is actually a performance problem, you could try optimizing it using policies. Until then, write code that's most maintainable so you have development time left to optimize later.
Remeber: Premature optimization is the root of all evil!
In general, if you need behavior to vary at runtime you are going to have to pay some overhead cost for that, whether it be a switch/if statement or a virtual call. The question is how much runtime variance you need. If you're very confident you will only ever have a small number of types, then a switch statement may really be appropriate. Virtual calls give more flexibility for extending in the future, but you don't necessarily need that flexibility; it depends on the problem. That said, there's still a lot of of ways to implement your 'switch statement' or your 'virtual call'. Instead of a switch/if you could use the Visitor Pattern (more maintainable), and instead of virtual calls you could use function pointers (when it doesn't make sense for the class itself to specify the behavior that is invoked at runtime). Also, although I don't agree with everything the author says (I think he artificially makes his idea and OOP mutually exclusive) you might be interested in Data Oriented Programming, especially if you're working on rendering as your class names suggest.
Why do you oppose virtual calls? Is the overhead really considerable for you? I think the code becomes more readable when you express what you want to do by writing an interface and different implementations instead of some unreadable templates.
Anyway, why do you inherit Vertices from Policy class? You already have it as a template argument. Looks like composition is more appropriate here. If you use inheritance, you can have just one non-template class Vertices and change its behaviour by passing different Policy objects - this is Strategy pattern.
class Policy {
public:
void sendDraw() const =0;
}
class Vertices {
public:
Vertices(Policy * policy) :
: policy(policy)
{
}
void render() {
// Do something with policy->sendDraw();
}
}
I don't see anything wrong with the first one - it doesn't look like an unmaintainable mess to me, although there's not enough code here to determine if there might be a better refactoring.
If you aren't putting the draw calls into a display list then the array data will have to be copied out when it's drawn. (Either the caller blocks until the GPU is done, or the driver copies it out of the app memory to somewhere safe.) So the virtual function won't be an issue. And if you ARE putting them in a display list, then the virtual function won't be an issue, because it's only being set up the once.
And in any event, PCs do virtual calls very quickly. They're not free, it's true, but if drawing (say) thousands of sets of vertices per frame then an extra virtual function call per draw is highly unlikely to break the bank. Out of all the things to think about ahead of time, avoiding uses of a virtual function in the very sorts of situation that virtual functions are designed for is probably one of the less important ones. Unnecessarily-virtual functions are worth avoiding; genuinely useful virtual functions are innocent until proven guilty...
(Drawing more vertices per call and changing shader, shader constants, vertex format and render state settings less frequently are likely to pay greater dividends.)