I am trying to optimize the run time of my code and I was told that removing unnecessary virtual functions was the way to go. With that in mind I would still like to use inheritance to avoid unnecessary code bloat. I thought that if I simply redefined the functions I wanted and initialized different variable values I could get by with just downcasting to my derived class whenever I needed derived class specific behavior.
So I need a variable that identifies the type of class that I am dealing with so I can use a switch statement to downcast properly. I am using the following code to test this approach:
Classes.h
#pragma once
class A {
public:
int type;
static const int GetType() { return 0; }
A() : type(0) {}
};
class B : public A {
public:
int type;
static const int GetType() { return 1; }
B() : {type = 1}
};
Main.cpp
#include "Classes.h"
#include <iostream>
using std::cout;
using std::endl;
using std::getchar;
int main() {
A *a = new B();
cout << a->GetType() << endl;
cout << a->type;
getchar();
return 0;
}
I get the output expected: 0 1
Question 1: Is there a better way to store type so that I do not need to waste memory for each instance of the object created (like the static keyword would allow)?
Question 2: Would it be more effective to put the switch statement in the function to decide that it should do based on the type value, or switch statement -> downcast then use a derived class specific function.
Question 3: Is there a better way to handle this that I am entirely overlooking that does not use virtual functions? For Example, should I just create an entirely new class that has many of the same variables
Question 1: Is there a better way to store type so that I do not need to waste memory for each instance of the object created (like the static keyword would allow)?
There's the typeid() already enabled with RTTI, there's no need you implement that yourself in an error prone and unreliable way.
Question 2: Would it be more effective to put the switch statement in the function to decide that it should do based on the type value, or switch statement -> downcast then use a derived class specific function.
Certainly no! That's a heavy indicator of bad (sic!) class inheritance hierarchy design.
Question 3: Is there a better way to handle this that I am entirely overlooking that does not use virtual functions? For Example, should I just create an entirely new class that has many of the same variables
The typical way to realize polymorphism without usage of virtual functions is the CRTP (aka Static Polymorphism).
That's a widely used technique to avoid the overhead of virtual function tables when you don't really need them, and just want to adapt your specific needs (e.g. with small targets, where low memory overhead is crucial).
Given your example1, that would be something like this:
template<class Derived>
class A {
protected:
int InternalGetType() { return 0; }
public:
int GetType() { static_cast<Derived*>(this)->InternalGetType(); }
};
class B : public A<B> {
friend class A<B>;
protected:
int InternalGetType() { return 1; }
};
All binding will be done at compile time, and there's zero runtime overhead.
Also binding is safely guaranteed using the static_cast, that will throw compiler errors, if B doesn't actually inherits A<B>.
Note (almost disclaimer):
Don't use that pattern as a golden hammer! It has it's drawbacks also:
It's harder to provide abstract interfaces, and without prior type trait checks or concepts, you'll confuse your clients with hard to read compiler error messages at template instantiantion.
That's not applicable for plugin like architecture models, where you really want to have late binding, and modules loaded at runtime.
If you don't have really heavy restrictions regarding executable's code size and performance, it's not worth doing the extra work necessary. For most systems you can simply neglect the dispatch overhead done with virtual function defintions.
1)The semantics of GetType() isn't necessarily the best one, but well ...
Go ahead and use virtual functions, but make sure each of those functions is doing enough work that the overhead of an indirect call is insignificant. That shouldn't be very hard to do, a virtual call is pretty fast - it wouldn't be part of C++ if it wasn't.
Doing your own pointer casting is likely to be even slower, unless you can use that pointer a significant number of times.
To make this a little more concrete, here's some code:
class A {
public:
int type;
int buffer[1000000];
A() : type(0) {}
virtual void VirtualIncrease(int n) { buffer[n] += 1; }
void NonVirtualIncrease(int n) { buffer[n] += 1; }
virtual void IncreaseAll() { for i=0; i<1000000; ++i) buffer[i] += 1; }
};
class B : public A {
public:
B() : {type = 1}
virtual void VirtualIncrease(int n) { buffer[n] += 2; }
void NonVirtualIncrease(int n) { buffer[n] += 2; }
virtual void IncreaseAll() { for i=0; i<1000000; ++i) buffer[i] += 2; }
};
int main() {
A *a = new B();
// easy way with virtual
for (int i = 0; i < 1000000; ++i)
a->VirtualIncrease(i);
// hard way with switch
for (int i = 0; i < 1000000; ++i) {
switch(a->type) {
case 0:
a->NonVirtualIncrease(i);
break;
case 1:
static_cast<B*>(a)->NonVirtualIncrease(i);
break;
}
}
// fast way
a->IncreaseAll();
getchar();
return 0;
}
The code that switches using a type code is not only much harder to read, it's probably slower as well. Doing more work inside a virtual function ends up being both cleanest and fastest.
I am in the process of refactoring a c++ OpenGL app I made (technically, an app that makes heavy use of the thin OpenGL wrapper in Qt's QQuickItem class). My app is performing ok but could be better.
One of the issues I'm curious about relates to the use of virtual functions in very time-sensitive (frame rate) algorithms. My OpenGL drawing code calls many virtual functions on the various objects that need drawing. Since this happens many times per second, I am wondering if the virtual dispatch could bring down the frame rate.
I am thinking about changing to this structure instead which avoids inheritance by keeping everything in one base class, but where the previously virtual functions now simply contain switch statements to call the appropriate routine based on the class's "type" which is really just a typedef enum:
Previously:
struct Base{
virtual void a()=0;
virtual void b()=0;
}
struct One : public Base{
void a(){...}
void b(){...}
}
Considering:
struct Combined{
MyEnumTypeDef t; //essentially holds what "type" of object this is
void a(){
switch (t){
case One:
....
break;
case Two:
....
break;
}
}
}
When function a() is called very often in OpenGL drawing routines, I am tempted to think that the Combined class will be considerably more efficient since it does not require the dynamic dispatch on virtual tables.
I'd appreciate some advice on this issue, if this is wise or not.
In your case, it probably does not matter. I say probably because, and I mean this constructively, the fact that you did not specify performance requirements and did not specify how often the function in question is called indicates that you may not have enough information to make a judgment now - the "don't speculate: profile" blanket response actually only intends to make sure you have all the necessary information required, because premature micro-optimizations are very common and our real goal is to help you out in the big picture.
Jeremy Friesner really hit the nail on the head with his comment on another answer here:
If you don't understand why it is slow you won't be able to speed it up.
So, with all that in mind, assuming that either A) Your performance requirements are already being met (e.g. you are pulling 4000 FPS - far higher than any display refresh rate) or B) You are struggling to meet performance requirements and this function is only called a few (say < 1000-ish) times per frame) or C) You are struggling to meet performance requirements and this function is called often but does a lot of other significant work (and thus function call overhead is negligible), then:
Using a virtual function may, at most, end up in one extra lookup in a table somewhere (and possibly some cache misses - but not so much if it is accessed repeatedly in, say, an inner loop), which is a few CPU clock cycles worst case (and most likely still less than a switch, although that is really moot here), and this is completely insignificant compared to your target frame rate, the effort required to render a frame, and any other algorithms and logic you are performing. If you'd like to prove it to yourself, profile.
What you should do is use whatever technique leads to the clearest, cleanest, most maintainable and readable code. Micro-optimizations such as this are not going to have an effect, and the cost in code maintainability, even if it is minor, is not worth the benefit, which is essentially zero.
What you should also do is sit down and get a handle on your actual situation. Do you need to improve performance? Is this function called enough to actually have a significant impact or should you be concentrating on other techniques (e.g. higher level algorithms, other design strategies, off-loading computation to the GPU or using machine-specific optimizations e.g. bulk operations with SSE, etc.)?
One thing you could do, in the absence of concrete information, is try both approaches. While performance will differ from machine to machine, you could at least get a rough idea of the impact of this particular bit of code on your overall performance (e.g. if you're shooting for 60 FPS, and these two options give you 23.2 FPS vs. 23.6 FPS, then this isn't where you want to focus, and possible sacrifices made by choosing one of these strategies over the other may not be worth it).
Also consider using call lists, vertex index buffers, etc. OpenGL provides many facilities for optimizing drawing of objects where certain aspects remain constant. For example, if you have a huge surface model with small parts whose vertex coordinates change often, divide the model into sections, using call lists, and only update the call list for the section that changed when it has changed since the last redraw. Leave e.g. coloring and texturing out of the call list (or use coordinate arrays) if they change often. That way you can avoid calling your functions altogether.
If you're curious, here is a test program (which probably does not represent your actual usage, again, this is not possible to answer with the information given - this test is the one requested in comments below). This does not mean these results will be reflected in your program and, again, you need to have concrete information about your actual requirements. But, here this is just for giggles:
This test program compares a switch-based operation vs. a virtual-function based operation vs. pointer-to-member (where member is called from another class member function) vs. pointer-to-member (where member is called directly from test loop). It also performs three types of tests: A run on a dataset with just one operator, a run that alternates back and forth between two operators, and a run that uses a random mix of two operators.
Output when compiled with gcc -O0, for 1,000,000,000 iterations:
$ g++ -O0 tester.cpp
$ ./a.out
--------------------
Test: time=6.34 sec (switch add) [-358977076]
Test: time=6.44 sec (switch subtract) [358977076]
Test: time=6.96 sec (switch alternating) [-281087476]
Test: time=18.98 sec (switch mixed) [-314721196]
Test: time=6.11 sec (virtual add) [-358977076]
Test: time=6.19 sec (virtual subtract) [358977076]
Test: time=7.88 sec (virtual alternating) [-281087476]
Test: time=19.80 sec (virtual mixed) [-314721196]
Test: time=10.96 sec (ptm add) [-358977076]
Test: time=10.83 sec (ptm subtract) [358977076]
Test: time=12.53 sec (ptm alternating) [-281087476]
Test: time=24.24 sec (ptm mixed) [-314721196]
Test: time=6.94 sec (ptm add (direct)) [-358977076]
Test: time=6.89 sec (ptm subtract (direct)) [358977076]
Test: time=9.12 sec (ptm alternating (direct)) [-281087476]
Test: time=21.19 sec (ptm mixed (direct)) [-314721196]
Output when compiled with gcc -O3, for 1,000,000,000 iterations:
$ g++ -O3 tester.cpp ; ./a.out
--------------------
Test: time=0.87 sec (switch add) [372023620]
Test: time=1.28 sec (switch subtract) [-372023620]
Test: time=1.29 sec (switch alternating) [101645020]
Test: time=7.71 sec (switch mixed) [855607628]
Test: time=2.95 sec (virtual add) [372023620]
Test: time=2.95 sec (virtual subtract) [-372023620]
Test: time=14.74 sec (virtual alternating) [101645020]
Test: time=9.39 sec (virtual mixed) [855607628]
Test: time=4.20 sec (ptm add) [372023620]
Test: time=4.21 sec (ptm subtract) [-372023620]
Test: time=13.11 sec (ptm alternating) [101645020]
Test: time=9.32 sec (ptm mixed) [855607628]
Test: time=3.37 sec (ptm add (direct)) [372023620]
Test: time=3.37 sec (ptm subtract (direct)) [-372023620]
Test: time=13.08 sec (ptm alternating (direct)) [101645020]
Test: time=9.74 sec (ptm mixed (direct)) [855607628]
Note that -O3 does a lot, and without looking at the assembler, we cannot use this as a 100% accurate representation of the problem at hand.
In the unoptimized case, we notice:
Virtual outperforms switch in runs of a single operator.
Switch outperforms virtual in cases where multiple operators used.
Pointer-to-member when the member is called directly (object->*ptm_) is similar to, but slower than, virtual.
Pointer-to-member when the member is called through another method (object->doit() where doit() calls this->*ptm_) takes a little under twice the time.
As expected, "mixed" case performance suffers due to branch prediction failures.
In the optimized case:
Switch outperforms virtual in all cases.
Similar characteristics for pointer-to-member as unoptimized case.
All "alternating" cases that involve function pointers at some point are slower than with -O0 and slower than "mixed" for reasons I do not understand. This does not occur on my PC at home.
What is especially significant here is how much the effects of e.g. branch prediction outweigh any choice of "virtual" vs. "switch". Again, be sure you understand your code and are optimizing the right thing.
The other significant thing here is this represents time differences on the order of 1-14 nanoseconds per operation. This difference could be significant for large numbers of operations, but is likely negligible compared to other things you are doing (note that these functions perform only a single arithmetic operation, any more than that will quickly dwarf the effects of virtual vs. switch).
Note also that while calling the pointer-to-member directly showed an "improvement" over calling it through another class member, this has potentially large impacts on overall design as such an implementation (at least in this case, where something outside the class calls the member directly) could not be dropped in as a direct replacement for another implementation due to different syntax for calling pointer-to-member functions (-> vs. ->*). I had to create a whole separate set of test cases just to handle that, for example.
Conclusion
Performance difference will easily be dwarfed by even a couple extra arithmetic operations. Note also that branch prediction has a far more significant impact in all but the "virtual alternating" case with -O3. However, test is also unlikely to be representative of actual application (which the OP has kept a secret), and -O3 introduces even more variables, and so results must be taken with a grain of salt and are unlikely to be applicable to other scenarios (in other words, test may be interesting, but is not particularly meaningful).
Source:
// === begin timing ===
#ifdef __linux__
# include <sys/time.h>
typedef struct timeval Time;
static void tick (Time &t) {
gettimeofday(&t, 0);
}
static double delta (const Time &a, const Time &b) {
return
(double)(b.tv_sec - a.tv_sec) +
(double)(b.tv_usec - a.tv_usec) / 1000000.0;
}
#else // windows; untested, working from memory; sorry for compile errors
# include <windows.h>
typedef LARGE_INTEGER Time;
static void tick (Time &t) {
QueryPerformanceCounter(&t);
}
static double delta (const Time &a, const Time &b) {
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
return (double)(b.QuadPart - a.QuadPart) / (double)freq.QuadPart;
}
#endif
// === end timing
#include <cstdio>
#include <cstdlib>
#include <ctime>
using namespace std;
// Size of dataset.
static const size_t DATASET_SIZE = 10000000;
// Repetitions per test.
static const unsigned REPETITIONS = 100;
// Class performs operations with a switch statement.
class OperatorSwitch {
public:
enum Op { Add, Subtract };
explicit OperatorSwitch (Op op) : op_(op) { }
int perform (int a, int b) const {
switch (op_) {
case Add: return a + b;
case Subtract: return a - b;
}
}
private:
Op op_;
};
// Class performs operations with pointer-to-member.
class OperatorPTM {
public:
enum Op { Add, Subtract };
explicit OperatorPTM (Op op) {
perform_ = (op == Add) ?
&OperatorPTM::performAdd :
&OperatorPTM::performSubtract;
}
int perform (int a, int b) const { return (this->*perform_)(a, b); }
int performAdd (int a, int b) const { return a + b; }
int performSubtract (int a, int b) const { return a - b; }
//private:
int (OperatorPTM::*perform_) (int, int) const;
};
// Base class for virtual-function test operator.
class OperatorBase {
public:
virtual ~OperatorBase () { }
virtual int perform (int a, int b) const = 0;
};
// Addition
class OperatorAdd : public OperatorBase {
public:
int perform (int a, int b) const { return a + b; }
};
// Subtraction
class OperatorSubtract : public OperatorBase {
public:
int perform (int a, int b) const { return a - b; }
};
// No base
// Addition
class OperatorAddNoBase {
public:
int perform (int a, int b) const { return a + b; }
};
// Subtraction
class OperatorSubtractNoBase {
public:
int perform (int a, int b) const { return a - b; }
};
// Processes the dataset a number of times, using 'oper'.
template <typename T>
static void test (const int *dataset, const T *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i)
result = oper->perform(result, dataset[i]);
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Processes the dataset a number of times, alternating between 'oper[0]'
// and 'oper[1]' per element.
template <typename T>
static void testalt (const int *dataset, const T * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i)
result = oper[i&1]->perform(result, dataset[i]);
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Processes the dataset a number of times, choosing between 'oper[0]'
// and 'oper[1]' randomly (based on value in dataset).
template <typename T>
static void testmix (const int *dataset, const T * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i) {
int d = dataset[i];
result = oper[d&1]->perform(result, d);
}
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Same as test() but calls perform_() pointer directly.
static void test_ptm (const int *dataset, const OperatorPTM *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i)
result = (oper->*(oper->perform_))(result, dataset[i]);
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Same as testalt() but calls perform_() pointer directly.
static void testalt_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i) {
const OperatorPTM *op = oper[i&1];
result = (op->*(op->perform_))(result, dataset[i]);
}
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
// Same as testmix() but calls perform_() pointer directly.
static void testmix_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {
int result = 0;
Time start, stop;
tick(start);
for (unsigned n = 0; n < REPETITIONS; ++ n)
for (size_t i = 0; i < DATASET_SIZE; ++ i) {
int d = dataset[i];
const OperatorPTM *op = oper[d&1];
result = (op->*(op->perform_))(result, d);
}
tick(stop);
// result is computed and printed so optimizations do not discard it.
printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
fflush(stdout);
}
int main () {
int *dataset = new int[DATASET_SIZE];
srand(time(NULL));
for (int n = 0; n < DATASET_SIZE; ++ n)
dataset[n] = rand();
OperatorSwitch *switchAdd = new OperatorSwitch(OperatorSwitch::Add);
OperatorSwitch *switchSub = new OperatorSwitch(OperatorSwitch::Subtract);
OperatorSwitch *switchAlt[2] = { switchAdd, switchSub };
OperatorBase *virtAdd = new OperatorAdd();
OperatorBase *virtSub = new OperatorSubtract();
OperatorBase *virtAlt[2] = { virtAdd, virtSub };
OperatorPTM *ptmAdd = new OperatorPTM(OperatorPTM::Add);
OperatorPTM *ptmSub = new OperatorPTM(OperatorPTM::Subtract);
OperatorPTM *ptmAlt[2] = { ptmAdd, ptmSub };
while (true) {
printf("--------------------\n");
test(dataset, switchAdd, "switch add");
test(dataset, switchSub, "switch subtract");
testalt(dataset, switchAlt, "switch alternating");
testmix(dataset, switchAlt, "switch mixed");
test(dataset, virtAdd, "virtual add");
test(dataset, virtSub, "virtual subtract");
testalt(dataset, virtAlt, "virtual alternating");
testmix(dataset, virtAlt, "virtual mixed");
test(dataset, ptmAdd, "ptm add");
test(dataset, ptmSub, "ptm subtract");
testalt(dataset, ptmAlt, "ptm alternating");
testmix(dataset, ptmAlt, "ptm mixed");
test_ptm(dataset, ptmAdd, "ptm add (direct)");
test_ptm(dataset, ptmSub, "ptm subtract (direct)");
testalt_ptm(dataset, ptmAlt, "ptm alternating (direct)");
testmix_ptm(dataset, ptmAlt, "ptm mixed (direct)");
}
}
The model of "lots of objects that draw themselves" is attractive, but bad in a sneaky way. It's not the virtual function call overhead (which exists, but is small), it encourages a rendering anti-pattern: letting every object draw itself in isolation. It sounds like one of those things touted in "software engineering best practices", but it's not, it's very bad. Every object would make a lot of expensive API calls (such as binding shaders and textures). Now, I don't really know what your code looks like, maybe it doesn't work like this, the objects aren't necessarily bad, it's how they're used.
Anyway, here are some suggestions.
Sort your objects by the state (shader, texture, vertex buffer, in that order) they want (actually, don't sort - put them in buckets and iterate of those). This is easy, everyone does it, and it may be enough.
Merge states, so there's nothing to switch between. Use übershaders. Use texture arrays, or better, bindless textures (which doesn't have the problem that all slices have to be the same format/size/etc). Use a huge vertex buffer that you put everything into. Use uniform buffers. Use persistent mapping for the dynamic buffers.
And finally, glMultiDrawElementsIndirect. If you've put everything into buffers anyway, as per the previous suggestion, then you will need only very few calls to glMultiDrawElementsIndirect. Very few, you can do a lot with just one call. What you'd probably use otherwise is a bunch of glDrawArrays with no binding in between, which isn't bad either, but it's not much effort to make it even better.
The end result is that the actual drawing code has almost disappeared. Almost all API calls are gone, replaced by writes to buffers.
It would be faster to not use virtual functions, but whether the difference is significant is hard to say. You should run your program through a profiler to see where it's spending its time. You might discover that the cpu cycles are spent on something entirely different and you'd be wasting your time (and degrading your design) by messing with the virtuals.
Unix: How can I profile C++ code running in Linux?
Windows: What's the best free C++ profiler for Windows?
Another option to consider might be to use the Curiously Recurring Template Pattern: http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern to get similar polymorphism without using virtuals.
So suppose I want to make a series of classes that each have a member-function with the same thing. Let's call the function
void doYourJob();
I want to eventually put all these classes into the same container so that I can loop through them and have each perform 'doYourJob()'
The obvious solution is to make an abstract class with the function
virtual void doYourJob();
but I'm hesitant to do so. This is a time-expensive program and a virtual function would slime it up considerably. Also, this function is the only thing the classes have in common with each other and doYourJob is implimented completely differently for each class.
Is there a way to avoid using an abstract class with a virtual function or am I going to have to suck it up?
If you need the speed, consider embedding a "type(-identifying) number" in the objects, and using a switch statement to select the type-specific code. This can avoid function call overhead completely - just doing a local jump. You won't get faster than that. A cost (in terms of maintainability, recompilation dependencies etc) is in forcing localisation (in the switch) of the type-specific functionality.
IMPLEMENTATION
#include <iostream>
#include <vector>
// virtual dispatch model...
struct Base
{
virtual int f() const { return 1; }
};
struct Derived : Base
{
virtual int f() const { return 2; }
};
// alternative: member variable encodes runtime type...
struct Type
{
Type(int type) : type_(type) { }
int type_;
};
struct A : Type
{
A() : Type(1) { }
int f() const { return 1; }
};
struct B : Type
{
B() : Type(2) { }
int f() const { return 2; }
};
struct Timer
{
Timer() { clock_gettime(CLOCK_MONOTONIC, &from); }
struct timespec from;
double elapsed() const
{
struct timespec to;
clock_gettime(CLOCK_MONOTONIC, &to);
return to.tv_sec - from.tv_sec + 1E-9 * (to.tv_nsec - from.tv_nsec);
}
};
int main(int argc)
{
for (int j = 0; j < 3; ++j)
{
typedef std::vector<Base*> V;
V v;
for (int i = 0; i < 1000; ++i)
v.push_back(i % 2 ? new Base : (Base*)new Derived);
int total = 0;
Timer tv;
for (int i = 0; i < 100000; ++i)
for (V::const_iterator i = v.begin(); i != v.end(); ++i)
total += (*i)->f();
double tve = tv.elapsed();
std::cout << "virtual dispatch: " << total << ' ' << tve << '\n';
// ----------------------------
typedef std::vector<Type*> W;
W w;
for (int i = 0; i < 1000; ++i)
w.push_back(i % 2 ? (Type*)new A : (Type*)new B);
total = 0;
Timer tw;
for (int i = 0; i < 100000; ++i)
for (W::const_iterator i = w.begin(); i != w.end(); ++i)
{
if ((*i)->type_ == 1)
total += ((A*)(*i))->f();
else
total += ((B*)(*i))->f();
}
double twe = tw.elapsed();
std::cout << "switched: " << total << ' ' << twe << '\n';
// ----------------------------
total = 0;
Timer tw2;
for (int i = 0; i < 100000; ++i)
for (W::const_iterator i = w.begin(); i != w.end(); ++i)
total += (*i)->type_;
double tw2e = tw2.elapsed();
std::cout << "overheads: " << total << ' ' << tw2e << '\n';
}
}
PERFORMANCE RESULTS
On my Linux system:
~/dev g++ -O2 -o vdt vdt.cc -lrt
~/dev ./vdt
virtual dispatch: 150000000 1.28025
switched: 150000000 0.344314
overhead: 150000000 0.229018
virtual dispatch: 150000000 1.285
switched: 150000000 0.345367
overhead: 150000000 0.231051
virtual dispatch: 150000000 1.28969
switched: 150000000 0.345876
overhead: 150000000 0.230726
This suggests an inline type-number-switched approach is about (1.28 - 0.23) / (0.344 - 0.23) = 9.2 times as fast. Of course, that's specific to the exact system tested / compiler flags & version etc., but generally indicative.
COMMENTS RE VIRTUAL DISPATCH
It must be said though that virtual function call overheads are something that's rarely significant, and then only for oft-called trivial functions (like getters and setters). Even then, you might be able to provide a single function to get and set a whole lot of things at once, minimising the cost. People worry about virtual dispatch way too much - so do do the profiling before finding awkward alternatives. The main issue with them is that they perform an out-of-line function call, though they also delocalise the code executed which changes the cache utilisation patterns (for better or (more often) worse).
Virtual functions don't cost much. They are an indirect call, basically like a function pointer.
What is the performance cost of having a virtual method in a C++ class?
If you're in a situation where every cycle per call counts, that is you're doing very little work in the function call and you're calling it from your inner loop in a performance critical application you probably need a different approach altogether.
I'm afraid that a series of dynamic_cast checks in a loop would slime up performance worse than a virtual function. If you're going to throw them all in one container, they need to have some type in common, so you may as well make it a pure-virtual base class with that method in it.
There's not all that much to the virtual function dispatch in that context: a vtable lookup, an adjustment of the supplied this pointer, and an indirect call.
If performance is that critical, you might be able to use a separate container for each subtype and process each container independently. If order matters, you'd be doing so many backflips that the virtual dispatch is probably faster.
If you're going to store all of these objects in the same container, then either you're going to have to write a heterogeneous container type (slow and expensive), you're going to have to store a container of void *s (yuck!), or the classes are going to have to be related to each other via inheritance. If you opt to go with either of the first two options, you'll have to have some logic in place to look at each element in the container, figure out what type it is, and then call the appropriate doYourJob() implementation, which essentially boils down to inheritance.
I strongly suggest trying out the simple, straightforward approach of using inheritance first. If this is fast enough, that's great! You're done. If it isn't, then try using some other scheme. Never avoid a useful language feature because of the cost unless you have some good hard proof to suggest that the cost is too great.