In my class design, I use abstract classes and virtual functions extensively. I had a feeling that virtual functions affects the performance. Is this true? But I think this performance difference is not noticeable and looks like I am doing premature optimization. Right?
Your question made me curious, so I went ahead and ran some timings on the 3GHz in-order PowerPC CPU we work with. The test I ran was to make a simple 4d vector class with get/set functions
class TestVec
{
float x,y,z,w;
public:
float GetX() { return x; }
float SetX(float to) { return x=to; } // and so on for the other three
}
Then I set up three arrays each containing 1024 of these vectors (small enough to fit in L1) and ran a loop that added them to one another (A.x = B.x + C.x) 1000 times. I ran this with the functions defined as inline, virtual, and regular function calls. Here are the results:
inline: 8ms (0.65ns per call)
direct: 68ms (5.53ns per call)
virtual: 160ms (13ns per call)
So, in this case (where everything fits in cache) the virtual function calls were about 20x slower than the inline calls. But what does this really mean? Each trip through the loop caused exactly 3 * 4 * 1024 = 12,288 function calls (1024 vectors times four components times three calls per add), so these times represent 1000 * 12,288 = 12,288,000 function calls. The virtual loop took 92ms longer than the direct loop, so the additional overhead per call was 7 nanoseconds per function.
From this I conclude: yes, virtual functions are much slower than direct functions, and no, unless you're planning on calling them ten million times per second, it doesn't matter.
See also: comparison of the generated assembly.
A good rule of thumb is:
It's not a performance problem until you can prove it.
The use of virtual functions will have a very slight effect on performance, but it's unlikely to affect the overall performance of your application. Better places to look for performance improvements are in algorithms and I/O.
An excellent article that talks about virtual functions (and more) is Member Function Pointers and the Fastest Possible C++ Delegates.
When Objective-C (where all methods are virtual) is the primary language for the iPhone and freakin' Java is the main language for Android, I think it's pretty safe to use C++ virtual functions on our 3 GHz dual-core towers.
In very performance critical applications (like video games) a virtual function call can be too slow. With modern hardware, the biggest performance concern is the cache miss. If data isn't in the cache, it may be hundreds of cycles before it's available.
A normal function call can generate an instruction cache miss when the CPU fetches the first instruction of the new function and it's not in the cache.
A virtual function call first needs to load the vtable pointer from the object. This can result in a data cache miss. Then it loads the function pointer from the vtable which can result in another data cache miss. Then it calls the function which can result in an instruction cache miss like a non-virtual function.
In many cases, two extra cache misses are not a concern, but in a tight loop on performance critical code it can dramatically reduce performance.
From page 44 of Agner Fog's "Optimizing Software in C++" manual:
The time it takes to call a virtual member function is a few clock cycles more than it takes to call a non-virtual member function, provided that the function call statement always calls the same version of the virtual function. If the version changes then you will get a misprediction penalty of 10 - 30 clock cycles. The rules for prediction and misprediction of virtual function calls is the same as for switch statements...
absolutely. It was a problem way back when computers ran at 100Mhz, as every method call required a lookup on the vtable before it was called. But today.. on a 3Ghz CPU that has 1st level cache with more memory than my first computer had? Not at all. Allocating memory from main RAM will cost you more time than if all your functions were virtual.
Its like the old, old days where people said structured programming was slow because all the code was split into functions, each function required stack allocations and a function call!
The only time I would even think of bothering to consider the performance impact of a virtual function, is if it was very heavily used and instantiated in templated code that ended up throughout everything. Even then, I wouldn't spend too much effort on it!
PS think of other 'easy to use' languages - all their methods are virtual under the covers and they don't crawl nowadays.
There's another performance criteria besides execution time. A Vtable takes up memory space as well, and in some cases can be avoided: ATL uses compile-time "simulated dynamic binding" with templates to get the effect of "static polymorphism", which is sort of hard to explain; you basically pass the derived class as a parameter to a base class template, so at compile time the base class "knows" what its derived class is in each instance. Won't let you store multiple different derived classes in a collection of base types (that's run-time polymorphism) but from a static sense, if you want to make a class Y that is the same as a preexisting template class X which has the hooks for this kind of overriding, you just need to override the methods you care about, and then you get the base methods of class X without having to have a vtable.
In classes with large memory footprints, the cost of a single vtable pointer is not much, but some of the ATL classes in COM are very small, and it's worth the vtable savings if the run-time polymorphism case is never going to occur.
See also this other SO question.
By the way here's a posting I found that talks about the CPU-time performance aspects.
Yes, you're right and if you curious about the cost of virtual function call you might find this post interesting.
The only ever way that I can see that a virtual function will become a performance problem is if many virtual functions are called within a tight loop, and if and only if they cause a page fault or other "heavy" memory operation to occur.
Though like other people have said it's pretty much never going to be a problem for you in real life. And if you think it is, run a profiler, do some tests, and verify if this really is a problem before trying to "undesign" your code for a performance benefit.
When class method is not virtual, compiler usually does in-lining. In contrary, when you use pointer to some class with virtual function, the real address will be known only at runtime.
This is well illustrated by test, time difference ~700% (!):
#include <time.h>
class Direct
{
public:
int Perform(int &ia) { return ++ia; }
};
class AbstrBase
{
public:
virtual int Perform(int &ia)=0;
};
class Derived: public AbstrBase
{
public:
virtual int Perform(int &ia) { return ++ia; }
};
int main(int argc, char* argv[])
{
Direct *pdir, dir;
pdir = &dir;
int ia=0;
double start = clock();
while( pdir->Perform(ia) );
double end = clock();
printf( "Direct %.3f, ia=%d\n", (end-start)/CLOCKS_PER_SEC, ia );
Derived drv;
AbstrBase *ab = &drv;
ia=0;
start = clock();
while( ab->Perform(ia) );
end = clock();
printf( "Virtual: %.3f, ia=%d\n", (end-start)/CLOCKS_PER_SEC, ia );
return 0;
}
The impact of virtual function call highly depends on situation.
If there are few calls and significant amount of work inside function - it could be negligible.
Or, when it is a virtual call repeatedly used many times, while doing some simple operation - it could be really big.
I've gone back and forth on this at least 20 times on my particular project. Although there can be some great gains in terms of code reuse, clarity, maintainability, and readability, on the other hand, performance hits still do exist with virtual functions.
Is the performance hit going to be noticeable on a modern laptop/desktop/tablet... probably not! However, in certain cases with embedded systems, the performance hit may be the driving factor in your code's inefficiency, especially if the virtual function is called over and over again in a loop.
Here's a some-what dated paper that anaylzes best practices for C/C++ in the embedded systems context: http://www.open-std.org/jtc1/sc22/wg21/docs/ESC_Boston_01_304_paper.pdf
To conclude: it's up to the programmer to understand the pros/cons of using a certain construct over another. Unless you're super performance driven, you probably don't care about the performance hit and should use all the neat OO stuff in C++ to help make your code as usable as possible.
In my experience, the main relevant thing is the ability to inline a function. If you have performance/optimization needs that dictate a function needs to be inlined, then you can't make the function virtual because it would prevent that. Otherwise, you probably won't notice the difference.
One thing to note is that this:
boolean contains(A element) {
for (A current : this)
if (element.equals(current))
return true;
return false;
}
may be faster than this:
boolean contains(A element) {
for (A current : this)
if (current.equals(element))
return true;
return false;
}
This is because the first method is only calling one function while the second may be calling many different functions. This applies to any virtual function in any language.
I say "may" because this depends on the compiler, the cache etc.
The performance penalty of using virtual functions can never outweight the advantages you get at the design level. Supposedly a call to a virtual function would be 25% less efficient then a direct call to a static function. This is because there is a level of indirection throught the VMT. However the time taken to make the call is normally very small compared to the time taken in the actual execution of your function so the total performance cost will be nigligable, especially with current performance of hardware.
Furthermore the compiler can sometimes optimise and see that no virtual call is needed and compile it into a static call. So don't worry use virtual functions and abstract classes as much as you need.
I always questioned myself this, especially since - quite a few years ago - I also did such a test comparing the timings of a standard member method call with a virtual one and was really angry about the results at that time, having empty virtual calls being 8 times slower than non-virtuals.
Today I had to decide whether or not to use a virtual function for allocating more memory in my buffer class, in a very performance critical app, so I googled (and found you), and in the end, did the test again.
// g++ -std=c++0x -o perf perf.cpp -lrt
#include <typeinfo> // typeid
#include <cstdio> // printf
#include <cstdlib> // atoll
#include <ctime> // clock_gettime
struct Virtual { virtual int call() { return 42; } };
struct Inline { inline int call() { return 42; } };
struct Normal { int call(); };
int Normal::call() { return 42; }
template<typename T>
void test(unsigned long long count) {
std::printf("Timing function calls of '%s' %llu times ...\n", typeid(T).name(), count);
timespec t0, t1;
clock_gettime(CLOCK_REALTIME, &t0);
T test;
while (count--) test.call();
clock_gettime(CLOCK_REALTIME, &t1);
t1.tv_sec -= t0.tv_sec;
t1.tv_nsec = t1.tv_nsec > t0.tv_nsec
? t1.tv_nsec - t0.tv_nsec
: 1000000000lu - t0.tv_nsec;
std::printf(" -- result: %d sec %ld nsec\n", t1.tv_sec, t1.tv_nsec);
}
template<typename T, typename Ua, typename... Un>
void test(unsigned long long count) {
test<T>(count);
test<Ua, Un...>(count);
}
int main(int argc, const char* argv[]) {
test<Inline, Normal, Virtual>(argc == 2 ? atoll(argv[1]) : 10000000000llu);
return 0;
}
And was really surprised that it - in fact - really does not matter at all anymore.
While it makes just sense to have inlines faster than non-virtuals, and them being faster then virtuals, it often comes to the load of the computer overall, whether your cache has the necessary data or not, and whilst you might be able to optimize at cache-level, I think, that this should be done by the compiler developers more than by application devs.
Related
Let's say you have a call to a method that calculates a value and returns it :
double calculate(const double& someArg);
You implement another calculate method that has the same profile as the first one, but works differently :
double calculate2(const double& someArg);
You want to be able to switch from one to the other based on a boolean setting, so you end up with something like this :
double calculate(const double& someArg)
{
if (useFirstVersion) // <-- this is a boolean
return calculate1(someArg); // actual first implementation
else
return calculate2(someArg); // second implementation
}
The boolean might change during runtime but it is quite rare.
I notice a small but noticeable performance hit that I suppose is due to either branch misprediction or cache unfriendly code.
How to optimize it to get the best runtime performances ?
My thoughts and attempts on this issue :
I tried using a pointer to function to make sure to avoid branch mispredictions :
The idea was when the boolean changes, I update the pointer to function. This way, there is no if/else, we use the pointer directly :
The pointer is defined like this :
double (ClassWeAreIn::*pCalculate)(const double& someArg) const;
... and the new calculate method becomes like this :
double calculate(const double& someArg)
{
(this->*(pCalculate))(someArg);
}
I tried using it in combination with __forceinline and it did make a difference (which I am unsure if that should be expected as the compiler should have done it already ?). Without __forceline it was the worst regarding performances, and with __forceinline, it seemed to be much better.
I thought of making calculate a virtual method with two overrides but I read that virtual methods are not a good way to optimize code as we still have to find the right method to call at runtime. I did not try it though.
However, whichever modifications I did, I never seemed to be able to restore the original performances (maybe it is not possible ?). Is there a design pattern to deal with this in the most optimal way (and possibly the cleaner/easier to maintain the better) ?
A complete example for VS :
main.cpp
#include "stdafx.h"
#include "SomeClass.h"
#include <time.h>
#include <stdlib.h>
#include <chrono>
#include <iostream>
int main()
{
srand(time(NULL));
auto start = std::chrono::steady_clock::now();
SomeClass someClass;
double result;
for (long long i = 0; i < 1000000000; ++i)
result = someClass.calculate(0.784542);
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << diff.count() << std::endl;
return 0;
}
SomeClass.cpp
#include "stdafx.h"
#include "SomeClass.h"
#include <math.h>
#include <stdlib.h>
double SomeClass::calculate(const double& someArg)
{
if (useFirstVersion)
return calculate1(someArg);
else
return calculate2(someArg);
}
double SomeClass::calculate1(const double& someArg)
{
return asinf((rand() % 10 + someArg)/10);
}
double SomeClass::calculate2(const double& someArg)
{
return acosf((rand() % 10 + someArg) / 10);
}
SomeClass.h
#pragma once
class SomeClass
{
public:
bool useFirstVersion = true;
double calculate(const double& someArg);
double calculate1(const double& someArg);
double calculate2(const double& someArg);
};
(I did not include the ptr to function in the example since it only seems to make things worse).
Using the example above, I get an average of 14,61s to run it when calling directly calculate1 in the main, whereas I get an average of 15,00s to run when calling calculate0 (with __forceinline, which seems to make the gap smaller).
Since useFirstVersion rarely changes, the execution path of calculate is very easily predictable by most branch prediction techniques. Performance degrades a bit because of the extra code necessary to implement the if/else logic. It also depends on whether the compiler inlines calculate, calculate1, or calculate2. Ideally, all of them should be inlined, although this is less likely to happen compared to calling calculate1 or calculate2 directly because the code size is larger. Note that I have not tried to reproduce your results, but there is nothing particularly suspicious about the 3% performance degradation. If you can make useFirstVersion so that it never changes dynamically, then you can turn it into a macro. Otherwise, the idea of calling calculate through a function pointer would eliminate most of the performance overhead. By the way, I don't think MSVC can inline calls through function pointers, yet these functions are good candidates for inlining.
In the end, if you are in the same situation as I was, I would advise the following :
Don't worry about branch misprediction if the right prediction rarely ever changes.
The cost seems to be marginal although I can't really provide exact figures to back it up.
The cost of the overhead of the new intermediary methods can be mitigated by __force inline in VC++
I am able to notice the difference and it was in the end the best way to avoid degrading performances. Only go this way if the methods you are inlining are small, like simple getters & such. I don't know why my compiler wouldn't choose to inline the methods by itself, but the __force inline actually made the trick (even though you cannot be sure the compiler will inline the methods as __force inline is only a suggestion to the compiler).
I have a class called WorkerA that works on one image format (let's just call it A, it's rather non-standard). The class has been working well:
class WorkerA
{
public:
void Setup()
{
//some stuff specific to format A
}
void MainTask()
{
//some algorithm that calls GetPixel() a lot
}
//...
protected:
int GetPixel(int x, int y)
{
int value;
//value = ... (gets pixel value in format A)
return value;
}
unsigned char * pBitmapA;
//...
};
Now I need another class that works on image format B. MainTask and a few other functions are the same as WorkerA, but the remaining function needs different implementations. Unsure of the best practice in this scenario, I hacked together something like the following:
class WorkerB : public WorkerA
{
public:
void Setup()
{
//some stuff specific to format B
}
//... (other functions. MainTask not re-implemented.)
protected:
virtual int GetPixel(int x, int y)
{
int value;
//value = ... (gets pixel value in format B)
return value;
}
unsigned char ** pBitmapB; //different format than pBitmapA
};
By this point, I also made WorkerA::GetPixel virtual, to get the correct polymorphic behavior when I call WorkerB::MainTask. However, this one change caused WorkerA::MainTask to run 50% longer than before -- something I really need to avoid.
My question is: how should I rearrange these 2 classes so that there's as little duplicated code as possible, without the speed penalty? I can completely rewrite WorkerA and WorkerB if necessary (although preferrably I can keep WorkerA's existing interface), but I can't change the image formats.
Generally any imaging code that makes a call for each pixel is going to be slow. If you can, refactor the code so it works with a much larger block, perhaps a raster line at a time.
If you can determine which class to use at compile time rather than run time, you can use the Curiously Recurring Template Pattern (CRTP) to eliminate the overhead of a virtual call.
Not really an answer, but there are themes here worth following.
GetPixel() is a notoriously bad thing from a performance point of view. Seriously consider using algorithms that do not require a heavy dependence on this. Maybe convert it to inline or template or macro if it really has to be.
Are you sure about the benchmarking? The inherent overhead from a virtual function call is a couple of machine instructions, which would not ordinarily cause such a severe impact. Are you sure there isn't something else going on here?
Do you really need virtual? Inheritance is a solution to many problems, not all of which require virtual functions and dynamic binding. Perhaps you can reorganise your code to use static inheritance or templating, at least for a good part of what you need, and avoid the virtual call to GetPixel() entirely.
If you have more info please edit your question accordingly.
I have a class with an enum member variable. One of the member functions bases its behavior on this enum so as a "possible" optimization, I have the two different behaviors as two different functions and I give the class a member function pointer which is set at construction. I simulated this situation like this:
enum catMode {MODE_A, MODE_B};
struct cat
{
cat(catMode mode) : stamp_(0), mode_(mode) {}
void
update()
{
stamp_ = (mode_ == MODE_A) ? funcA() : funcB();
}
uint64_t stamp_;
catMode mode_;
};
struct cat2
{
cat2(catMode mode) : stamp_(0), mode_(mode)
{
if (mode_ = MODE_A)
func_ = funcA;
else
func_ = funcB;
}
void
update()
{
stamp_ = func_();
}
uint64_t stamp_;
catMode mode_;
uint64_t (*func_)(void);
};
And then I create a cat object and an array of length 32. I traverse the array to bring it into cache, then I call cats update method 32 times and store the latency using rdtsc in the array...
Then I call a function which loops several hundred times using rand(), ulseep(), and some arbitrary strcmp()..come back and I do the 32 thing again.
The result is that the method with the branch seems to always be around 44 +/- 10 cycles whereas the one with the function pointer tends to be around 130. I'm curious as to why this would be the case?
If anything, I would have expected similar performance. Also, templating is hardly an option because full specialization of the real cat class for that one function would be overkill.
Without a complete SSCCE I can't approach this the same way that I usually do with such questions.So the best I can do is speculate:
The core difference between your two cases is that you have a branch vs. a function pointer. The fact that you are seeing a difference at all strongly hints that funcA() and funcB() are very small functions.
Possibility #1:
In the branch version of the code, funcA() and funcB() are being inlined by the compiler. Not only does that skip the function call overhead, but if the functions are trivial enough, the branch could also be completely optimized out as well.
Function pointers, on the other hand, cannot be inlined unless the compiler can resolve them at compile-time.
Possibility #2:
By comparing a branch against a function-pointer, you are putting the branch-predictor against the branch target predictor.
Branch target prediction is not the same as branch prediction. In the branch case, the processor needs to predict which way to branch. In the function pointer case, it needs to predict where to branch to.
It's very likely that your processor's branch predictor is much more accurate than its branch target predictor. But then again, this is all guesswork...
I think, it is easier explain using an example. Let's take a class that model the speed of a Formula 1 car, the interface may look something like:
class SpeedF1
{
public:
explicit SpeedF1(double speed);
double getSpeed() const;
void setSpeed(double newSpeed);
//other stuff as unit
private:
double speed_;
};
Now, negative speed are not meaningful in this particular case and neither value greater then 500 km/h. In this case the constructor and the setSpeed function may throw exceptions if the value provide is not within a logical range.
I can introduce an extra layer of abstraction and insert a extra object instead of double.
The new object will be a wrapper around the double and it will be construct and never modify.
The interface of the class will be changed to:
class ReasonableSpeed
{
public:
explicit ReasonableSpeed(double speed); //may throw a logic error
double getSpeed() const;
//no setter are provide
private:
double speed_;
};
class SpeedF1
{
public:
explicit SpeedF1(const ReasonableSpeed& speed);
ReasonableSpeed getSpeed() const;
void setSpeed(const ReasonableSpeed& newSpeed);
//other stuff as unit
private:
ReasonableSpeed speed_;
};
With this design SpeedF1 cannot throw, however I need to extra pay an object constructor every time I want to reset the speed.
For class where a limited set of value are reasonable (for example the Months in a calendar) I usually make the constructor private and provide a complete set of static functions. In this case it is impossible, another possibility is implement a null object pattern but I am not sure it is superior to simply throw an exception.
Finally, my question is:
What is the best practise to solve this kind of problem?
First off, don’t overestimate the cost of the extra constructor. In fact, this cost should be exactly the cost of initialising a double plus the cost for the validity check. In other words, it is likely equal to using a raw double.
Secondly, lose the setter. Setters – and, to a lesser degree, getters – are almost always anti-patterns. If you need to set a new (maximum) speed, chances are you actually want a new car.
Now, about the actual problem: a throwing constructor is completely fine in principle. Don’t write convoluted code to avoid such a construct.
On the other hand, I also like the idea of self-checking types. This makes the best use of C++’ type system and I’m all in favour of that.
Both alternatives have their advantages. Which one is best really depends on the exact situation. In general, I try to exploit the type system and static type checking as much as possible. In your case, this would mean having an extra type for the speed.
I strongly vote for the second option. This is only my personal opinion without a lot of academic backing. My experience is that setting up a "pure" system that operates on only valid data makes your code a lot cleaner. This can be achieved by using your second approach which ensures that only valid data enters the system.
If your system grows, you may find that ReasonableSpeed gets used in a lot of places (use your discretion, but chances are things actually get reused quite a lot). The second approach will save you a lot of error checking codes in the long term.
If only one class inherits from ReasonableSpeed then it seems a bit of an overkill.
If many classes inherit from, or use ReasonableSpeed, then it's smart.
Both of your designs yield the same result when an invalid value is used as speed, i.e. they both throw an exception. Applying Occam's razor principle, or Unix Rule of Rarsimony:
Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
‘Big’ here has the sense both of large in volume of code and of internal complexity. Allowing programs to get large hurts maintainability. Because people are reluctant to throw away the visible product of lots of work, large programs invite overinvestment in approaches that are failed or suboptimal.
you may like to pick the first simpler approach. Unless you'd like to reuse ReasonableSpeed class.
I would recommend doing this instead:
class SpeedF1
{
public:
explicit SpeedF1(double maxSpeed);
double getSpeed() const;
void accelerate();
void decelerate();
protected:
void setSpeed(double speed);
//other stuff as unit
private:
double maxSpeed_;
double curSpeed_;
};
SpeedF1::SpeedF1(double maxSpeed) maxSpeed_(maxSpeed), curSpeed_(0.0) { }
double SpeedF1::getSpeed() const { return curSpeed_; }
void SpeedF1::setSpeed(double speed) {
if(speed < 0.0) speed = 0.0;
if(speed > maxSpeed_) speed = maxSpeed_;
curSpeed = speed;
}
void SpeedF1::accelerate() {
setSpeed(curSpeed_ + SOME_CONSTANT_VELOCITY);
}
void SpeedF1::decelerate() {
setSpeed(curSpeed_ - SOME_CONSTANT_VELOCITY);
}
Before you cringe at the duplicate title, the other question wasn't suited to what I ask here (IMO). So.
I am really wanting to use virtual functions in my application to make things a hundred times easier (isn't that what OOP is all about ;)). But I read somewhere they came at a performance cost, seeing nothing but the same old contrived hype of premature optimization, I decided to give it a quick whirl in a small benchmark test using:
CProfiler.cpp
#include "CProfiler.h"
CProfiler::CProfiler(void (*func)(void), unsigned int iterations) {
gettimeofday(&a, 0);
for (;iterations > 0; iterations --) {
func();
}
gettimeofday(&b, 0);
result = (b.tv_sec * (unsigned int)1e6 + b.tv_usec) - (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
};
main.cpp
#include "CProfiler.h"
#include <iostream>
class CC {
protected:
int width, height, area;
};
class VCC {
protected:
int width, height, area;
public:
virtual void set_area () {}
};
class CS: public CC {
public:
void set_area () { area = width * height; }
};
class VCS: public VCC {
public:
void set_area () { area = width * height; }
};
void profileNonVirtual() {
CS *abc = new CS;
abc->set_area();
delete abc;
}
void profileVirtual() {
VCS *abc = new VCS;
abc->set_area();
delete abc;
}
int main() {
int iterations = 5000;
CProfiler prf2(&profileNonVirtual, iterations);
CProfiler prf(&profileVirtual, iterations);
std::cout << prf.result;
std::cout << "\n";
std::cout << prf2.result;
return 0;
}
At first I only did 100 and 10000 iterations, and the results were worrying: 4ms for non virtualised, and 250ms for the virtualised! I almost went "nooooooo" inside, but then I upped the iterations to around 500,000; to see the results become almost completely identical (maybe 5% slower without optimization flags enabled).
My question is, why was there such a significant change with a low amount of iterations compared to high amount? Was it purely because the virtual functions are hot in cache at that many iterations?
Disclaimer
I understand that my 'profiling' code is not perfect, but it, as it has, gives an estimate of things, which is all that matters at this level. Also I am asking these questions to learn, not to solely optimize my application.
I believe that your test case is too artificial to be of any great value.
First, inside your profiled function you dynamically allocate and deallocate an object as well as call a function, if you want to profile just the function call then you should do just that.
Second, you are not profiling a case where a virtual function call represents a viable alternative to a given problem. A virtual function call provides dynamic dispatch. You should try profiling a case such as where a virtual function call is used as an alternative to something using a switch-on-type anti-pattern.
Extending Charles' answer.
The problem here is that your loop is doing more than just testing the virtual call itself (the memory allocation probably dwarfs the virtual call overhead anyway), so his suggestion is to change the code so that only the virtual call is tested.
Here the benchmark function is template, because template may be inlined while call through function pointers are unlikely to.
template <typename Type>
double benchmark(Type const& t, size_t iterations)
{
timeval a, b;
gettimeofday(&a, 0);
for (;iterations > 0; --iterations) {
t.getArea();
}
gettimeofday(&b, 0);
return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
(a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}
Classes:
struct Regular
{
Regular(size_t w, size_t h): _width(w), _height(h) {}
size_t getArea() const;
size_t _width;
size_t _height;
};
// The following line in another translation unit
// to avoid inlining
size_t Regular::getArea() const { return _width * _height; }
struct Base
{
Base(size_t w, size_t h): _width(w), _height(h) {}
virtual size_t getArea() const = 0;
size_t _width;
size_t _height;
};
struct Derived: Base
{
Derived(size_t w, size_t h): Base(w, h) {}
virtual size_t getArea() const;
};
// The following two functions in another translation unit
// to avoid inlining
size_t Derived::getArea() const { return _width * _height; }
std::auto_ptr<Base> generateDerived()
{
return std::auto_ptr<Base>(new Derived(3,7));
}
And the measuring:
int main(int argc, char* argv[])
{
if (argc != 2) {
std::cerr << "Usage: %prog iterations\n";
return 1;
}
Regular regular(3, 7);
std::auto_ptr<Base> derived = generateDerived();
double regTime = benchmark<Regular>(regular, atoi(argv[1]));
double derTime = benchmark<Base>(*derived, atoi(argv[1]));
std::cout << "Regular: " << regTime << "\nDerived: " << derTime << "\n";
return 0;
}
Note: this tests the overhead of a virtual call in comparison to a regular function. The functionality is different (since you do not have runtime dispatch in the second case), but it's therefore a worst-case overhead.
EDIT:
Results of the run (gcc.3.4.2, -O2, SLES10 quadcore server) note: with the functions definitions in another translation unit, to prevent inlining
> ./test 5000000
Regular: 17041
Derived: 17194
Not really convincing.
With a small number of iterations there's a chance that your code is preempted with some other program running in parallel or swapping occurs or anything else operating system isolates your program from happens and you'll have the time it was suspended by the operating system included into your benchmark results. This is number one reason why you should run your code something like a dozen million times to measure anything more or less reliably.
There is a performance impact to calling a virtual function, because it does slightly more than calling a regular function. However, the impact is likely to be completely negligible in a real-world application -- even less so than appear in even the most finely crafted benchmarks.
In a real world application, the alternative to a virtual function is usually going to involve you hand-writing some similar system anyhow, because the behavior of calling a virtual function and calling a non-virtual function differs -- the former changes based on the runtime type of the invoking object. Your benchmark, even disregarding whatever flaws it has, doesn't measure equivalent behavior, only equivalent-ish syntax. If you were to institute a coding policy banning virtual functions you'd either have to write some potentially very roundabout or confusing code (which might be slower) or re-implement a similar kind of runtime dispatch system that the compiler is using to implement virtual function behavior (which is certainly going to be no faster than what the compiler does, in most cases).
I think that this kind of testing is pretty useless, in fact:
1) you are wasting time for profiling itself invoking gettimeofday();
2) you are not really testing virtual functions, and IMHO this is the worst thing.
Why? Because you use virtual functions to avoid writing things such as:
<pseudocode>
switch typeof(object) {
case ClassA: functionA(object);
case ClassB: functionB(object);
case ClassC: functionC(object);
}
</pseudocode>
in this code, you miss the "if... else" block so you don't really get the advantage of virtual functions. This is a scenario where they are always "loser" against non-virtual.
To do a proper profiling, I think you should add something like the code I've posted.
There could be several reasons for the difference in time.
your timing function isn't precise enough
the heap manager may influence the result, because sizeof(VCS) > sizeof(VS). What happens if you move the new / delete out of the loop?
Again, due to size differences, memory cache may indeed be part of the difference in time.
BUT: you should really compare similar functionality. When using virtual functions, you do so for a reason, which is calling a different member function dependent on the object's identity. If you need this functionality, and don't want to use virtual functions, you would have to implement it manually, be it using a function table or even a switch statement. This comes at a cost, too, and that's what you should compare against virtual functions.
When using too few iterations, there is a lot of noise in the measurement. The gettimeofday function is not going to be accurate enough to give you good measurements for only a handful of iterations, not to mention that it records total wall time (which includes time spent when preempted by other threads).
Bottom line, though, you shouldn't come up with some ridiculously convoluted design to avoid virtual functions. They really don't add much overhead. If you have incredibly performance critical code and you know that virtual functions make up most of the time, then perhaps it's something to worry about. In any practical application, though, virtual functions won't be what's making your application slow.
In my opinion, When there was less number of loops, may be there was no context switching, But when you increased the number of loops, then there are very strong chances that context switching takes place and that is dominating the reading. For example first program takes 1 sec and second program 3 secs, but if context switch takes 10 secs, then the difference is 13/11 instead of 3/1.