Virtual function performance: one large class vs many smaller subclasses - c++
I am in the process of refactoring a c++ OpenGL app I made (technically, an app that makes heavy use of the thin OpenGL wrapper in Qt's QQuickItem class). My app is performing ok but could be better.
One of the issues I'm curious about relates to the use of virtual functions in very time-sensitive (frame rate) algorithms. My OpenGL drawing code calls many virtual functions on the various objects that need drawing. Since this happens many times per second, I am wondering if the virtual dispatch could bring down the frame rate.
I am thinking about changing to this structure instead which avoids inheritance by keeping everything in one base class, but where the previously virtual functions now simply contain switch statements to call the appropriate routine based on the class's "type" which is really just a typedef enum:
Previously:
struct Base{
virtual void a()=0;
virtual void b()=0;
}
struct One : public Base{
void a(){...}
void b(){...}
}
Considering:
struct Combined{
MyEnumTypeDef t; //essentially holds what "type" of object this is
void a(){
switch (t){
case One:
....
break;
case Two:
....
break;
}
}
}
When function a() is called very often in OpenGL drawing routines, I am tempted to think that the Combined class will be considerably more efficient since it does not require the dynamic dispatch on virtual tables.
I'd appreciate some advice on this issue, if this is wise or not.
In your case, it probably does not matter. I say probably because, and I mean this constructively, the fact that you did not specify performance requirements and did not specify how often the function in question is called indicates that you may not have enough information to make a judgment now - the "don't speculate: profile" blanket response actually only intends to make sure you have all the necessary information required, because premature micro-optimizations are very common and our real goal is to help you out in the big picture.
Jeremy Friesner really hit the nail on the head with his comment on another answer here:
If you don't understand why it is slow you won't be able to speed it up.
So, with all that in mind, assuming that either A) Your performance requirements are already being met (e.g. you are pulling 4000 FPS - far higher than any display refresh rate) or B) You are struggling to meet performance requirements and this function is only called a few (say < 1000-ish) times per frame) or C) You are struggling to meet performance requirements and this function is called often but does a lot of other significant work (and thus function call overhead is negligible), then:
Using a virtual function may, at most, end up in one extra lookup in a table somewhere (and possibly some cache misses - but not so much if it is accessed repeatedly in, say, an inner loop), which is a few CPU clock cycles worst case (and most likely still less than a switch, although that is really moot here), and this is completely insignificant compared to your target frame rate, the effort required to render a frame, and any other algorithms and logic you are performing. If you'd like to prove it to yourself, profile.
What you should do is use whatever technique leads to the clearest, cleanest, most maintainable and readable code. Micro-optimizations such as this are not going to have an effect, and the cost in code maintainability, even if it is minor, is not worth the benefit, which is essentially zero.
What you should also do is sit down and get a handle on your actual situation. Do you need to improve performance? Is this function called enough to actually have a significant impact or should you be concentrating on other techniques (e.g. higher level algorithms, other design strategies, off-loading computation to the GPU or using machine-specific optimizations e.g. bulk operations with SSE, etc.)?
One thing you could do, in the absence of concrete information, is try both approaches. While performance will differ from machine to machine, you could at least get a rough idea of the impact of this particular bit of code on your overall performance (e.g. if you're shooting for 60 FPS, and these two options give you 23.2 FPS vs. 23.6 FPS, then this isn't where you want to focus, and possible sacrifices made by choosing one of these strategies over the other may not be worth it).
Also consider using call lists, vertex index buffers, etc. OpenGL provides many facilities for optimizing drawing of objects where certain aspects remain constant. For example, if you have a huge surface model with small parts whose vertex coordinates change often, divide the model into sections, using call lists, and only update the call list for the section that changed when it has changed since the last redraw. Leave e.g. coloring and texturing out of the call list (or use coordinate arrays) if they change often. That way you can avoid calling your functions altogether.
If you're curious, here is a test program (which probably does not represent your actual usage, again, this is not possible to answer with the information given - this test is the one requested in comments below). This does not mean these results will be reflected in your program and, again, you need to have concrete information about your actual requirements. But, here this is just for giggles:
This test program compares a switch-based operation vs. a virtual-function based operation vs. pointer-to-member (where member is called from another class member function) vs. pointer-to-member (where member is called directly from test loop). It also performs three types of tests: A run on a dataset with just one operator, a run that alternates back and forth between two operators, and a run that uses a random mix of two operators.
Output when compiled with gcc -O0, for 1,000,000,000 iterations:
$ g++ -O0 tester.cpp
$ ./a.out
--------------------
Test: time=6.34 sec (switch add) [-358977076]
Test: time=6.44 sec (switch subtract) [358977076]
Test: time=6.96 sec (switch alternating) [-281087476]
Test: time=18.98 sec (switch mixed) [-314721196]
Test: time=6.11 sec (virtual add) [-358977076]
Test: time=6.19 sec (virtual subtract) [358977076]
Test: time=7.88 sec (virtual alternating) [-281087476]
Test: time=19.80 sec (virtual mixed) [-314721196]
Test: time=10.96 sec (ptm add) [-358977076]
Test: time=10.83 sec (ptm subtract) [358977076]
Test: time=12.53 sec (ptm alternating) [-281087476]
Test: time=24.24 sec (ptm mixed) [-314721196]
Test: time=6.94 sec (ptm add (direct)) [-358977076]
Test: time=6.89 sec (ptm subtract (direct)) [358977076]
Test: time=9.12 sec (ptm alternating (direct)) [-281087476]
Test: time=21.19 sec (ptm mixed (direct)) [-314721196]
Output when compiled with gcc -O3, for 1,000,000,000 iterations:
$ g++ -O3 tester.cpp ; ./a.out
--------------------
Test: time=0.87 sec (switch add) [372023620]
Test: time=1.28 sec (switch subtract) [-372023620]
Test: time=1.29 sec (switch alternating) [101645020]
Test: time=7.71 sec (switch mixed) [855607628]
Test: time=2.95 sec (virtual add) [372023620]
Test: time=2.95 sec (virtual subtract) [-372023620]
Test: time=14.74 sec (virtual alternating) [101645020]
Test: time=9.39 sec (virtual mixed) [855607628]
Test: time=4.20 sec (ptm add) [372023620]
Test: time=4.21 sec (ptm subtract) [-372023620]
Test: time=13.11 sec (ptm alternating) [101645020]
Test: time=9.32 sec (ptm mixed) [855607628]
Test: time=3.37 sec (ptm add (direct)) [372023620]
Test: time=3.37 sec (ptm subtract (direct)) [-372023620]
Test: time=13.08 sec (ptm alternating (direct)) [101645020]
Test: time=9.74 sec (ptm mixed (direct)) [855607628]
Note that -O3 does a lot, and without looking at the assembler, we cannot use this as a 100% accurate representation of the problem at hand.
In the unoptimized case, we notice:
Virtual outperforms switch in runs of a single operator.
Switch outperforms virtual in cases where multiple operators used.
Pointer-to-member when the member is called directly (object->*ptm_) is similar to, but slower than, virtual.
Pointer-to-member when the member is called through another method (object->doit() where doit() calls this->*ptm_) takes a little under twice the time.
As expected, "mixed" case performance suffers due to branch prediction failures.
In the optimized case:
Switch outperforms virtual in all cases.
Similar characteristics for pointer-to-member as unoptimized case.
All "alternating" cases that involve function pointers at some point are slower than with -O0 and slower than "mixed" for reasons I do not understand. This does not occur on my PC at home.
What is especially significant here is how much the effects of e.g. branch prediction outweigh any choice of "virtual" vs. "switch". Again, be sure you understand your code and are optimizing the right thing.
The other significant thing here is this represents time differences on the order of 1-14 nanoseconds per operation. This difference could be significant for large numbers of operations, but is likely negligible compared to other things you are doing (note that these functions perform only a single arithmetic operation, any more than that will quickly dwarf the effects of virtual vs. switch).
Note also that while calling the pointer-to-member directly showed an "improvement" over calling it through another class member, this has potentially large impacts on overall design as such an implementation (at least in this case, where something outside the class calls the member directly) could not be dropped in as a direct replacement for another implementation due to different syntax for calling pointer-to-member functions (-> vs. ->*). I had to create a whole separate set of test cases just to handle that, for example.
Conclusion
Performance difference will easily be dwarfed by even a couple extra arithmetic operations. Note also that branch prediction has a far more significant impact in all but the "virtual alternating" case with -O3. However, test is also unlikely to be representative of actual application (which the OP has kept a secret), and -O3 introduces even more variables, and so results must be taken with a grain of salt and are unlikely to be applicable to other scenarios (in other words, test may be interesting, but is not particularly meaningful).
Source:
// === begin timing ===
#ifdef __linux__
# include <sys/time.h>
typedef struct timeval Time;
static void tick (Time &t) {
gettimeofday(&t, 0);
}
static double delta (const Time &a, const Time &b) {
return
(double)(b.tv_sec - a.tv_sec) +
(double)(b.tv_usec - a.tv_usec) / 1000000.0;
}
#else // windows; untested, working from memory; sorry for compile errors
# include <windows.h>
typedef LARGE_INTEGER Time;
static void tick (Time &t) {
QueryPerformanceCounter(&t);
}
static double delta (const Time &a, const Time &b) {
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
return (double)(b.QuadPart - a.QuadPart) / (double)freq.QuadPart;
}
#endif
// === end timing
#include <cstdio>
#include <cstdlib>
#include <ctime>
using namespace std;
// Size of dataset.
static const size_t DATASET_SIZE = 10000000;
// Repetitions per test.
static const unsigned REPETITIONS = 100;
// Class performs operations with a switch statement.
class OperatorSwitch {
public:
enum Op { Add, Subtract };
explicit OperatorSwitch (Op op) : op_(op) { }
int perform (int a, int b) const {
switch (op_) {
case Add: return a + b;
case Subtract: return a - b;
}
}
private:
Op op_;
};
// Class performs operations with pointer-to-member.
class OperatorPTM {
public:
enum Op { Add, Subtract };
explicit OperatorPTM (Op op) {
perform_ = (op == Add) ?
&OperatorPTM::performAdd :
&OperatorPTM::performSubtract;
}
int perform (int a, int b) const { return (this->*perform_)(a, b); }
int performAdd (int a, int b) const { return a + b; }
int performSubtract (int a, int b) const { return a - b; }
//private:
int (OperatorPTM::*perform_) (int, int) const;
};
// Base class for virtual-function test operator.
class OperatorBase {
public:
virtual ~OperatorBase () { }
virtual int perform (int a, int b) const = 0;
};
// Addition
class OperatorAdd : public OperatorBase {
public:
int perform (int a, int b) const { return a + b; }
};
// Subtraction
class OperatorSubtract : public OperatorBase {
public:
int perform (int a, int b) const { return a - b; }
};
// No base
// Addition
class OperatorAddNoBase {
public:
int perform (int a, int b) const { return a + b; }
};
// Subtraction
class OperatorSubtractNoBase {
public:
int perform (int a, int b) const { return a - b; }
};
// Processes the dataset a number of times, using 'oper'.
template <typename T>
static void test (const int *dataset, const T *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i)
result = oper->perform(result, dataset[i]);
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Processes the dataset a number of times, alternating between 'oper[0]'
// and 'oper[1]' per element.
template <typename T>
static void testalt (const int *dataset, const T * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i)
result = oper[i&1]->perform(result, dataset[i]);
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Processes the dataset a number of times, choosing between 'oper[0]'
// and 'oper[1]' randomly (based on value in dataset).
template <typename T>
static void testmix (const int *dataset, const T * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i) {
int d = dataset[i];
result = oper[d&1]->perform(result, d);
}
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Same as test() but calls perform_() pointer directly.
static void test_ptm (const int *dataset, const OperatorPTM *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i)
result = (oper->*(oper->perform_))(result, dataset[i]);
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Same as testalt() but calls perform_() pointer directly.
static void testalt_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i) {
const OperatorPTM *op = oper[i&1];
result = (op->*(op->perform_))(result, dataset[i]);
}
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Same as testmix() but calls perform_() pointer directly.
static void testmix_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i) {
int d = dataset[i];
const OperatorPTM *op = oper[d&1];
result = (op->*(op->perform_))(result, d);
}
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
int main () {
int *dataset = new int[DATASET_SIZE];
srand(time(NULL));
for (int n = 0; n < DATASET_SIZE; ++ n)
dataset[n] = rand();
OperatorSwitch *switchAdd = new OperatorSwitch(OperatorSwitch::Add);
OperatorSwitch *switchSub = new OperatorSwitch(OperatorSwitch::Subtract);
OperatorSwitch *switchAlt[2] = { switchAdd, switchSub };
OperatorBase *virtAdd = new OperatorAdd();
OperatorBase *virtSub = new OperatorSubtract();
OperatorBase *virtAlt[2] = { virtAdd, virtSub };
OperatorPTM *ptmAdd = new OperatorPTM(OperatorPTM::Add);
OperatorPTM *ptmSub = new OperatorPTM(OperatorPTM::Subtract);
OperatorPTM *ptmAlt[2] = { ptmAdd, ptmSub };
while (true) {
printf("--------------------\n");
test(dataset, switchAdd, "switch add");
test(dataset, switchSub, "switch subtract");
testalt(dataset, switchAlt, "switch alternating");
testmix(dataset, switchAlt, "switch mixed");
test(dataset, virtAdd, "virtual add");
test(dataset, virtSub, "virtual subtract");
testalt(dataset, virtAlt, "virtual alternating");
testmix(dataset, virtAlt, "virtual mixed");
test(dataset, ptmAdd, "ptm add");
test(dataset, ptmSub, "ptm subtract");
testalt(dataset, ptmAlt, "ptm alternating");
testmix(dataset, ptmAlt, "ptm mixed");
test_ptm(dataset, ptmAdd, "ptm add (direct)");
test_ptm(dataset, ptmSub, "ptm subtract (direct)");
testalt_ptm(dataset, ptmAlt, "ptm alternating (direct)");
testmix_ptm(dataset, ptmAlt, "ptm mixed (direct)");
}
}
The model of "lots of objects that draw themselves" is attractive, but bad in a sneaky way. It's not the virtual function call overhead (which exists, but is small), it encourages a rendering anti-pattern: letting every object draw itself in isolation. It sounds like one of those things touted in "software engineering best practices", but it's not, it's very bad. Every object would make a lot of expensive API calls (such as binding shaders and textures). Now, I don't really know what your code looks like, maybe it doesn't work like this, the objects aren't necessarily bad, it's how they're used.
Anyway, here are some suggestions.
Sort your objects by the state (shader, texture, vertex buffer, in that order) they want (actually, don't sort - put them in buckets and iterate of those). This is easy, everyone does it, and it may be enough.
Merge states, so there's nothing to switch between. Use übershaders. Use texture arrays, or better, bindless textures (which doesn't have the problem that all slices have to be the same format/size/etc). Use a huge vertex buffer that you put everything into. Use uniform buffers. Use persistent mapping for the dynamic buffers.
And finally, glMultiDrawElementsIndirect. If you've put everything into buffers anyway, as per the previous suggestion, then you will need only very few calls to glMultiDrawElementsIndirect. Very few, you can do a lot with just one call. What you'd probably use otherwise is a bunch of glDrawArrays with no binding in between, which isn't bad either, but it's not much effort to make it even better.
The end result is that the actual drawing code has almost disappeared. Almost all API calls are gone, replaced by writes to buffers.
It would be faster to not use virtual functions, but whether the difference is significant is hard to say. You should run your program through a profiler to see where it's spending its time. You might discover that the cpu cycles are spent on something entirely different and you'd be wasting your time (and degrading your design) by messing with the virtuals.
Unix: How can I profile C++ code running in Linux?
Windows: What's the best free C++ profiler for Windows?
Another option to consider might be to use the Curiously Recurring Template Pattern: http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern to get similar polymorphism without using virtuals.
Related
How can I only allow a function A to be called only after function B is called?
Consider the following code class: class A { public: int number; vector<int> powers; A () { number = 5; powers.resize(100); } long long getPower(int x) { return powers[x]; } void precompute() { powers[0] = 1; for (int i = 1; i < 100; i++) { powers[i] = powers[i - 1] * number; } } }; In the class A, we have a vector called powers and an integer number with the property that powers[k] stores the quantity numbers^k after the precompute() function has been called. If we want to answer several queries of the form "Compute numbers^x for some integer 0 <= x < 100", it makes sense to precompute all of these powers and return them when we need them as a constant-time operation (note: this is not a problem that I am actually facing. I have made this problem up for the sake of an example. Please ignore the fact that numbers^x would exceed the maximum value of a long long). However, there is one issue: the user must have called the precompute() function before calling the getPower() function. This leads me to the following question: Is there some nice way to enforce the constraint that some function A can only be called after function B is called? Of course, one could just use a flag variable, but I am wondering if there is a more elegant way to do this so that it becomes a compile-time error. Another option would be to always call the precompute() function in the constructor, but this may not be an optimal solution if we weren't always going to call precompute() in the first place. If calling precompute() is a sufficiently expensive (computationally), then this method would not be preferable. I would prefer getting a compile-time error over a runtime error, but I am open to all approaches. Does anyone have any ideas?
One solution to your problem would be to call the precompute function in the constructor of class A. Alternatively, as has already been suggested in the comments section, you could make the function getPower check a flag which specifies whether precompute has already been called, and if not, either perform the call itself or print an error message. I can't think of a way to force this check to be done at compile time. However, if you want to eliminate this run-time check from release builds, you could use conditional compilation so that these checks are only included in debug builds, for example by using the assert macro or by using preprocessor directives, like this: // note that NDEBUG is normally only defined in release builds, not debug builds #ifdef NDEBUG //check for flag here and print error message if flag has unexpected value #endif
As alternative, to enforce timing dependency, you might make the dependency explicit. For example: class PowerGetter { friend class A; const A& a; public: long long getPower(int x) { return a.powers[x]; } }; class A { public: int number = 5; std::vector<int> powers = std::vector<int>(100); A() = default; PowerGetter precompute() { powers[0] = 1; for (int i = 1; i < 100; i++) { powers[i] = powers[i - 1] * number; } return {*this}; } }; Then to call getPower we need a PowerGetter which can only be obtained by calling precompute first. For that contrived example, simpler would be to place initialization in A though.
Am I missing something or are Virtual calls not as bad performance as people make of them
I have been developing a simple framework for embedded environments. I came to a design decision on whether to use virtual calls, CRTP, or maybe a switch statement. I have been told that vtables perform poorly in embedded. Following up from this question vftable performance penalty vs. switch statement I decided to run my own test. I ran three different ways to call a member function. using the etl library's etl::function, a library meant to mimic the stl library but for embedded environments.(no dynamic allocations). using a master switch statement that will call an object's based on an object's int ID using a pure virtual call to a base class I never tried this with a basic CRTP pattern but the etl::function was supposed to be a variation on that where that was the mechanism used for the pattern. The time I got on MSVC and similar performance on an ARM Cortex M4 was etl : 400 million nanoseconds switch : 420 million nanoseconds virtual: 290 million nanoseconds The pure virtual calls are significantly faster. Am I missing something or are virtual calls just not as bad as people make them out to be. Here is the code used for the tests. class testetlFunc { public: uint32_t a; testetlFunc() { a = 0; }; void foo(); }; class testetlFunc2 { public: uint32_t a; testetlFunc2() { a = 0; }; virtual void foo() = 0; }; void testetlFunc::foo() { a++; } class testetlFuncDerived : public testetlFunc2 { public: testetlFuncDerived(); void foo() override; }; testetlFuncDerived::testetlFuncDerived() { } void testetlFuncDerived::foo() { a++; } etl::ifunction<void>* timer1_callback1; etl::ifunction<void>* timer1_callback2; etl::ifunction<void>* timer1_callback3; etl::ifunction<void>* timer1_callback4; etl::ifunction<void>* etlcallbacks[4]; testetlFunc ttt; testetlFunc ttt2; testetlFunc ttt3; testetlFunc ttt4; testetlFuncDerived tttd1; testetlFuncDerived tttd2; testetlFuncDerived tttd3; testetlFuncDerived tttd4; testetlFunc2* tttarr[4]; static void MasterCallingFunction(uint16_t ID) { switch (ID) { case 1: ttt.foo(); break; case 2: ttt2.foo(); break; case 3: ttt3.foo(); break; case 4: ttt4.foo(); break; default: break; } }; int main() { tttarr[0] = (testetlFunc2*)&tttd1; tttarr[1] = (testetlFunc2*)&tttd2; tttarr[2] = (testetlFunc2*)&tttd3; tttarr[3] = (testetlFunc2*)&tttd4; etl::function_imv<testetlFunc, ttt, &testetlFunc::foo> k; timer1_callback1 = &k; etl::function_imv<testetlFunc, ttt2, &testetlFunc::foo> k2; timer1_callback2 = &k2; etl::function_imv<testetlFunc, ttt3, &testetlFunc::foo> k3; timer1_callback3 = &k3; etl::function_imv<testetlFunc, ttt4, &testetlFunc::foo> k4; timer1_callback4 = &k4; etlcallbacks[0] = timer1_callback1; etlcallbacks[1] = timer1_callback2; etlcallbacks[2] = timer1_callback3; etlcallbacks[3] = timer1_callback4; //results for etl::function -------------- int rng; srand(time(0)); StartTimer(1) for (uint32_t i = 0; i < 2000000; i++) { rng = rand() % 4 + 0; for (uint16_t j= 0; j < 4; j++) { (*etlcallbacks[rng])(); } } StopTimer(1) //results for switch -------------- StartTimer(2) for (uint32_t i = 0; i < 2000000; i++) { rng = rand() % 4 + 0; for (uint16_t j = 0; j < 4; j++) { MasterCallingFunction(rng); } } StopTimer(2) //results for virtual vtable -------------- StartTimer(3) for (uint32_t i = 0; i < 2000000; i++) { rng = rand() % 4 + 0; for (uint16_t j = 0; j < 4; j++) { tttarr[rng]->foo(); //ttt.foo(); } } StopTimer(3) PrintAllTimerDuration }
If what you really need is virtual dispatch, C++'s virtual calls are probably the most performant implementation you can get, and you should use them. Scores of compiler engineers have worked on optimizing them to the best performance they could get. The reason behind people saying to avoid virtual methods is in my experience for when you do not need them. Avoid the virtual keyword on methods that can be statically dispatched, and on hot spots in your code. Every time you call an object's virtual method, what happens is that the object's v-table is accessed (likely screwing up memory locality and flushing a cache or two), then a pointer is de-referenced to get at the actual function address, and then the actual function call happens. This is only fractions of a second slower, but if you're fractions slower enough times in a loop, it suddenly makes a difference. When you call a static method, none of the earlier operations happen. The actual function call just happens. If the function that calls and the one that is called are close to each other in memory, all caches can stay the way they are. So, avoid virtual dispatch in high-performance or low-CPU-power situations in tight loops (you can for example switch on a member variable and call a method that contains the entire loop instead). But there is the saying "premature optimization is the root of all evil". Measure performance beforehand. "Embedded" CPUs have become much faster and more powerful than those a few years ago. Compilers for popular CPUs are better optimized than ones only just adapted to a new or exotic CPU. It may simply be that your compiler has an optimizer that alleviates any problems, or that your CPU is similar enough to a common desktop CPU to reap the benefits of work done for more popular CPUs. Or you may have more RAM etc. than the people who told you to avoid virtual calls. So, profile, and if the profiler says it's fine, it's fine. Also make sure your tests are representative. Your test code may just be written in a way that a network request coming in pre-empted the switch statement and made it seem slower than it really was, or that the virtual method calls were benefiting from the cache loaded by the non-virtual calls.
Virtual function calling cost is 1.5x of normal function call (with test case)
I have to decide whether to use template vs virtual-inheritance. In my situation, the trade-off make it really hard to choose. Finally, it boiled down to "How much virtual-calling is really cost (CPU)?" I found very few resources that dare to measure the vtable cost in actual number e.g. https://stackoverflow.com/a/158644, which point to page 26 of http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf. Here is an excerpt from it:- However, this overhead (of virtual) is on the order of 20% and 12% – far less than the variability between compilers. Before relying on the fact, I have decided to test it myself. My test code is a little long (~ 40 lines), you can also see it in the links in action. The number is ratio of time that virtual-calling used divided by normal-calling. Unexpectedly, the result is contradict to what open-std stated. http://coliru.stacked-crooked.com/a/d4d161464e83933f : 1.58 http://rextester.com/GEZMC77067 (with custom -O2): 1.89 http://ideone.com/nmblnK : 2.79 My own desktop computer (Visual C++, -O2) : around 1.5 Here is it :- #include <iostream> #include <chrono> #include <vector> using namespace std; class B2{ public: int randomNumber=((double) rand() / (RAND_MAX))*10; virtual ~B2() = default; virtual int f(int n){return -n+randomNumber;} int g(int n){return -n+randomNumber;} }; class C : public B2{ public: int f(int n) override {return n-randomNumber;} }; int main() { std::vector<B2*> bs; const int numTest=1000000; for(int n=0;n<numTest;n++){ if(((double) rand() / (RAND_MAX))>0.5){ bs.push_back(new B2()); }else{ bs.push_back(new C()); } }; auto t1 = std::chrono::system_clock::now(); int s=0; for(int n=0;n<numTest;n++){ s+=bs[n]->f(n); }; auto t2= std::chrono::system_clock::now(); for(int n=0;n<numTest;n++){ s+=bs[n]->g(n); }; auto t3= std::chrono::system_clock::now(); auto t21=t2-t1; auto t32=t3-t2; std::cout<<t21.count()<<" "<<t32.count()<<" ratio="<< (((float)t21.count())/t32.count()) << std::endl; std::cout<<s<<std::endl; for(int n=0;n<numTest;n++){ delete bs[n]; }; } Question Is it what to be expect that virtual calling is at least +50% slower than normal calling? Did I test it in a wrong-way? I have also read :- AI Applications in C++: How costly are virtual functions? What are the possible optimizations? Virtual functions and performance - C++
What's the cost of typeid?
I'm considering a type erasure setup that uses typeid to resolve the type like so... struct BaseThing { virtual ~BaseThing() = 0 {} }; template<typename T> struct Thing : public BaseThing { T x; }; struct A{}; struct B{}; int main() { BaseThing* pThing = new Thing<B>(); const std::type_info& x = typeid(*pThing); if( x == typeid(Thing<B>)) { std::cout << "pThing is a Thing<B>!\n"; Thing<B>* pB = static_cast<Thing<B>*>(pThing); } else if( x == typeid(Thing<A>)) { std::cout << "pThing is a Thing<A>!\n"; Thing<A>* pA = static_cast<Thing<A>*>(pThing); } } I've never seen anyone else do this. The alternative would be for BaseThing to have a pure virtual GetID() which would be used to deduce the type instead of using typeid. In this situation, with only 1 level of inheritance, what's the cost of typeid vs the cost of a virtual function call? I know typeid uses the vtable somehow, but how exactly does it work? This would be desirable instead of GetID() because it takes quite a bit of hackery to try to make sure the IDs are unique and deterministic.
The alternative would be for BaseThing to have a pure virtual GetID() which would be used to deduce the type instead of using typeid. In this situation, with only 1 level of inheritance, what's the cost of typeid vs the cost of a virtual function call? I know typeid uses the vtable somehow, but how exactly does it work? On Linux and Mac, or anything else using the Itanium C++ ABI, typeid(x) compiles into two load instructions — it simply loads the vptr (that is, the address of some vtable) from the first 8 bytes of object x, and then loads the -1th pointer from that vtable. That pointer is &typeid(x). This is one function call less expensive than calling a virtual method. On Windows, it involves on the order of four load instructions and a couple of (negligible) ALU ops, because the Microsoft C++ ABI is a bit more enterprisey. (source) This might end up being on par with a virtual method call, honestly. But that's still dirt cheap compared to a dynamic_cast. A dynamic_cast involves a function call into the C++ runtime, which has a lot of loads and conditional branches and such. So yes, exploiting typeid will be much much faster than dynamic_cast. Will it be correct for your use-case?— that's questionable. (See the other answers about Liskov substitutability and such.) But will it be fast?— yes. Here, I took the toy benchmark code from Vaughn's highly-rated answer and made it into an actual benchmark, avoiding the obvious loop-hoisting optimization that borked all his timings. Result, for libc++abi on my Macbook: $ g++ test.cc -lbenchmark -std=c++14; ./a.out Run on (4 X 2400 MHz CPU s) 2017-06-27 20:44:12 Benchmark Time CPU Iterations --------------------------------------------------------- bench_dynamic_cast 70407 ns 70355 ns 9712 bench_typeid 31205 ns 31185 ns 21877 bench_id_method 30453 ns 29956 ns 25039 $ g++ test.cc -lbenchmark -std=c++14 -O3; ./a.out Run on (4 X 2400 MHz CPU s) 2017-06-27 20:44:27 Benchmark Time CPU Iterations --------------------------------------------------------- bench_dynamic_cast 57613 ns 57591 ns 11441 bench_typeid 12930 ns 12844 ns 56370 bench_id_method 20942 ns 20585 ns 33965 (Lower ns is better. You can ignore the latter two columns: "CPU" just shows that it's spending all its time running and no time waiting, and "Iterations" is just the number of runs it took to get a good margin of error.) You can see that typeid thrashes dynamic_cast even at -O0, but when you turn on optimizations, it does even better — because the compiler can optimize any code that you write. All that ugly code hidden inside libc++abi's __dynamic_cast function can't be optimized by the compiler any more than it already has been, so turning on -O3 didn't help much.
Typically, you don't just want to know the type, but also do something with the object as that type. In that case, dynamic_cast is more useful: int main() { BaseThing* pThing = new Thing<B>(); if(Thing<B>* pThingB = dynamic_cast<Thing<B>*>(pThing)) { { // Do something with pThingB } else if(Thing<A>* pThingA = dynamic_cast<Thing<A>*>(pThing)) { { // Do something with pThingA } } I think this is why you rarely see typeid used in practice. Update: Since this question concerns performance. I ran some benchmarks on g++ 4.5.1. With this code: struct Base { virtual ~Base() { } virtual int id() const = 0; }; template <class T> struct Id; template<> struct Id<int> { static const int value = 1; }; template<> struct Id<float> { static const int value = 2; }; template<> struct Id<char> { static const int value = 3; }; template<> struct Id<unsigned long> { static const int value = 4; }; template <class T> struct Derived : Base { virtual int id() const { return Id<T>::value; } }; static const int count = 100000000; static int test1(Base *bp) { int total = 0; for (int iter=0; iter!=count; ++iter) { if (Derived<int>* dp = dynamic_cast<Derived<int>*>(bp)) { total += 5; } else if (Derived<float> *dp = dynamic_cast<Derived<float>*>(bp)) { total += 7; } else if (Derived<char> *dp = dynamic_cast<Derived<char>*>(bp)) { total += 2; } else if ( Derived<unsigned long> *dp = dynamic_cast<Derived<unsigned long>*>(bp) ) { total += 9; } } return total; } static int test2(Base *bp) { int total = 0; for (int iter=0; iter!=count; ++iter) { const std::type_info& type = typeid(*bp); if (type==typeid(Derived<int>)) { total += 5; } else if (type==typeid(Derived<float>)) { total += 7; } else if (type==typeid(Derived<char>)) { total += 2; } else if (type==typeid(Derived<unsigned long>)) { total += 9; } } return total; } static int test3(Base *bp) { int total = 0; for (int iter=0; iter!=count; ++iter) { int id = bp->id(); switch (id) { case 1: total += 5; break; case 2: total += 7; break; case 3: total += 2; break; case 4: total += 9; break; } } return total; } Without optimization, I got these runtimes: test1: 2.277s test2: 0.629s test3: 0.469s With optimization -O2, I got these runtimes: test1: 0.118s test2: 0.220s test3: 0.290s So it appears that dynamic_cast is the fastest method when using optimization with this compiler.
In almost all cases you don't want the exact type, but you want to make sure that it's of the given type or any type derived from it. If an object of a type derived from it cannot be substituted for an object of the type in question, then you are violating the Liskov Substitution Principle which is one of the most fundamental rules of proper OO design.
optimize output value using a class and public member
Suppose you have a function, and you call it a lot of times, every time the function return a big object. I've optimized the problem using a functor that return void, and store the returning value in a public member: #include <vector> const int N = 100; std::vector<double> fun(const std::vector<double> & v, const int n) { std::vector<double> output = v; output[n] *= output[n]; return output; } class F { public: F() : output(N) {}; std::vector<double> output; void operator()(const std::vector<double> & v, const int n) { output = v; output[n] *= n; } }; int main() { std::vector<double> start(N,10.); std::vector<double> end(N); double a; // first solution for (unsigned long int i = 0; i != 10000000; ++i) a = fun(start, 2)[3]; // second solution F f; for (unsigned long int i = 0; i != 10000000; ++i) { f(start, 2); a = f.output[3]; } } Yes, I can use inline or optimize in an other way this problem, but here I want to stress on this problem: with the functor I declare and construct the output variable output only one time, using the function I do that every time it is called. The second solution is two time faster than the first with g++ -O1 or g++ -O2. What do you think about it, is it an ugly optimization? Edit: to clarify my aim. I have to evaluate the function >10M times, but I need the output only few random times. It's important that the input is not changed, in fact I declared it as a const reference. In this example the input is always the same, but in real world the input change and it is function of the previous output of the function.
More common scenario is to create object with reserved large enough size outside the function and pass large object to the function by pointer or by reference. You could reuse this object on several calls to your function. Thus you could reduce continual memory allocation.
In both cases you are allocating new vector many many times. What you should do is to pass both input and output objects to your class/function: void fun(const std::vector<double> & in, const int n, std::vector<double> & out) { out[n] *= in[n]; } this way you separate your logic from the algorithm. You'll have to create a new std::vector once and pass it to the function as many time as you want. Notice that there's unnecessary no copy/allocation made. p.s. it's been awhile since I did c++. It may not compile right away.
It's not an ugly optimization. It's actually a fairly decent one. I would, however, hide output and make an operator[] member to access its members. Why? Because you just might be able to perform a lazy evaluation optimization by moving all the math to that function, thus only doing that math when the client requests that value. Until the user asks for it, why do it if you don't need to? Edit: Just checked the standard. Behavior of the assignment operator is based on insert(). Notes for that function state that an allocation occurs if new size exceeds current capacity. Of course this does not seem to explicitly disallow an implementation from reallocating even if otherwise...I'm pretty sure you'll find none that do and I'm sure the standard says something about it somewhere else. Thus you've improved speed by removing allocation calls. You should still hide the internal vector. You'll have more chance to change implementation if you use encapsulation. You could also return a reference (maybe const) to the vector from the function and retain the original syntax.
I played with this a bit, and came up with the code below. I keep thinking there's a better way to do this, but it's escaping me for now. The key differences: I'm allergic to public member variables, so I made output private, and put getters around it. Having the operator return void isn't necessary for the optimization, so I have it return the value as a const reference so we can preserve return value semantics. I took a stab at generalizing the approach into a templated base class, so you can then define derived classes for a particular return type, and not re-define the plumbing. This assumes the object you want to create takes a one-arg constructor, and the function you want to call takes in one additional argument. I think you'd have to define other templates if this varies. Enjoy... #include <vector> template<typename T, typename ConstructArg, typename FuncArg> class ReturnT { public: ReturnT(ConstructArg arg): output(arg){} virtual ~ReturnT() {} const T& operator()(const T& in, FuncArg arg) { output = in; this->doOp(arg); return this->getOutput(); } const T& getOutput() const {return output;} protected: T& getOutput() {return output;} private: virtual void doOp(FuncArg arg) = 0; T output; }; class F : public ReturnT<std::vector<double>, std::size_t, const int> { public: F(std::size_t size) : ReturnT<std::vector<double>, std::size_t, const int>(size) {} private: virtual void doOp(const int n) { this->getOutput()[n] *= n; } }; int main() { const int N = 100; std::vector<double> start(N,10.); double a; // second solution F f(N); for (unsigned long int i = 0; i != 10000000; ++i) { a = f(start, 2)[3]; } }
It seems quite strange(I mean the need for optimization at all) - I think that a decent compiler should perform return value optimization in such cases. Maybe all you need is to enable it.