I wanted to know how fast is a single-inheritance virtual function call when compared to one same boost::function call. Are they almost the same in performance or is boost::function slower?
I'm aware that performance may vary from case to case, but, as a general rule, which is faster, and to a how large degree is that so?
Thanks,
Guilherme
-- edit
KennyTM's test was sufficiently convincing for me. boost::function doesn't seem to be that much slower than a vcall for my own purposes. Thanks.
As a very special case, consider calling an empty function 109 times.
Code A:
struct X {
virtual ~X() {}
virtual void do_x() {};
};
struct Y : public X {}; // for the paranoid.
int main () {
Y* x = new Y;
for (int i = 100000000; i >= 0; -- i)
x->do_x();
delete x;
return 0;
}
Code B: (with boost 1.41):
#include <boost/function.hpp>
struct X {
void do_x() {};
};
int main () {
X* x = new X;
boost::function<void (X*)> f;
f = &X::do_x;
for (int i = 100000000; i >= 0; -- i)
f(x);
delete x;
return 0;
}
Compile with g++ -O3, then time with time,
Code A takes 0.30 seconds.
Code B takes 0.54 seconds.
Inspecting the assembly code, it seems that the slowness may be due to exceptions and handling the possibility and that f can be NULL. But given the price of one boost::function call is only 2.4 nanoseconds (on my 2 GHz machine), the actual code in your do_x() could shadow this pretty much. I would say, it's not a reason to avoid boost::function.
Related
It is fairly common knowledge that the most powerful tool in a compilers tool-belt is the inlining of functions into their call sites. But what about doing the reverse? If so, is it done? And when? For example given:
void foo(int x)
{
auto y = bar(x);
baz(y);
}
void bop()
{
int x;
auto y = bar(x);
baz(y);
}
Does it ever make sense for the compiler to abstract this out to
void qux(int x)
{
auto y = bar(x);
baz(y);
}
void foo(int x)
{
qux(x);
}
void bop()
{
int x;
qux(x);
}
Yes, for example LLVM has a MachineOutliner optimization pass.
Outlining makes sense even without repeated code, when the outlined section is [[unlikely]]. The function call is a loss, but unlikely, while on the other hand more likely code can fit in cache.
Compilers might also assume that an exception is unlikely, and outline the catch.
I have to decide whether to use template vs virtual-inheritance.
In my situation, the trade-off make it really hard to choose.
Finally, it boiled down to "How much virtual-calling is really cost (CPU)?"
I found very few resources that dare to measure the vtable cost in actual number e.g. https://stackoverflow.com/a/158644, which point to page 26 of http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf.
Here is an excerpt from it:-
However, this overhead (of virtual) is on the order of 20% and 12% – far less than
the variability between compilers.
Before relying on the fact, I have decided to test it myself.
My test code is a little long (~ 40 lines), you can also see it in the links in action.
The number is ratio of time that virtual-calling used divided by normal-calling.
Unexpectedly, the result is contradict to what open-std stated.
http://coliru.stacked-crooked.com/a/d4d161464e83933f : 1.58
http://rextester.com/GEZMC77067 (with custom -O2): 1.89
http://ideone.com/nmblnK : 2.79
My own desktop computer (Visual C++, -O2) : around 1.5
Here is it :-
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
class B2{
public: int randomNumber=((double) rand() / (RAND_MAX))*10;
virtual ~B2() = default;
virtual int f(int n){return -n+randomNumber;}
int g(int n){return -n+randomNumber;}
};
class C : public B2{
public: int f(int n) override {return n-randomNumber;}
};
int main() {
std::vector<B2*> bs;
const int numTest=1000000;
for(int n=0;n<numTest;n++){
if(((double) rand() / (RAND_MAX))>0.5){
bs.push_back(new B2());
}else{
bs.push_back(new C());
}
};
auto t1 = std::chrono::system_clock::now();
int s=0;
for(int n=0;n<numTest;n++){
s+=bs[n]->f(n);
};
auto t2= std::chrono::system_clock::now();
for(int n=0;n<numTest;n++){
s+=bs[n]->g(n);
};
auto t3= std::chrono::system_clock::now();
auto t21=t2-t1;
auto t32=t3-t2;
std::cout<<t21.count()<<" "<<t32.count()<<" ratio="<< (((float)t21.count())/t32.count()) << std::endl;
std::cout<<s<<std::endl;
for(int n=0;n<numTest;n++){
delete bs[n];
};
}
Question
Is it what to be expect that virtual calling is at least +50% slower than normal calling?
Did I test it in a wrong-way?
I have also read :-
AI Applications in C++: How costly are virtual functions? What are the possible optimizations?
Virtual functions and performance - C++
I wrote the following benchmark to estimate the overhead of virtual functions:
struct A{
int i = 0 ;
virtual void inc() __attribute__((noinline));
};
#ifdef VIRT
struct B : public A{
void inc() override __attribute__((noinline));
};
void A::inc() { }
void B::inc() { i++; }
#else
void A::inc() { i++; }
#endif
int main(){
#ifdef VIRT
B b;
A* p = &b;
#else
A a;
A* p = &a;
#endif
for( ;p->i < IT; p->inc()) {; }
return 0;
}
I compile it with
G=$((1000**3))
g++ -O1 -DIT=$((1*G)) -DVIRT virt.cc -o virt
g++ -O1 -DIT=$((1*G)) virt.cc -o nonvirt
And the results I got were that nonvirt was about 0.6ns slower than virt per function call at -O1 and about 0.3ns slower than virt at -O2 per function call.
How is this possible? I thought virtual functions were supposed to be slower.
First, just because you invoke a method through a pointer doesn't mean the compiler can't figure out the target type and make the call non-virtual. Plus, your program does nothing else, so everything will be well-predicted and in cache. Finally, a difference of 0.3 ns is one cycle, which is hardly worth noting. If you really want to dig into it, you could inspect the assembly code for each case on whatever your platform is.
On my system (Clang, OS X, old Macbook Air), the virtual case is a little slower, but it's hardly measurable with -O1 (e.g. 3.7 vs 3.6 seconds for non-virtual). And with -O2 there's no difference I can distinguish.
EDIT: Has been corrected
Your main is wrong. the for loop is defined 2 times in one case and once in the other. This should not impact performance since the second time the loop exits immediately ?
Correct it like that :
int main(){
#ifdef VIRT
B b;
A* p = &b;
/* removed this for loop */
#else
A a;
A* p = &a;
#endif
for( ;p->i < IT; p->inc()) {; }
return 0;
}
I'm considering a type erasure setup that uses typeid to resolve the type like so...
struct BaseThing
{
virtual ~BaseThing() = 0 {}
};
template<typename T>
struct Thing : public BaseThing
{
T x;
};
struct A{};
struct B{};
int main()
{
BaseThing* pThing = new Thing<B>();
const std::type_info& x = typeid(*pThing);
if( x == typeid(Thing<B>))
{
std::cout << "pThing is a Thing<B>!\n";
Thing<B>* pB = static_cast<Thing<B>*>(pThing);
}
else if( x == typeid(Thing<A>))
{
std::cout << "pThing is a Thing<A>!\n";
Thing<A>* pA = static_cast<Thing<A>*>(pThing);
}
}
I've never seen anyone else do this. The alternative would be for BaseThing to have a pure virtual GetID() which would be used to deduce the type instead of using typeid. In this situation, with only 1 level of inheritance, what's the cost of typeid vs the cost of a virtual function call? I know typeid uses the vtable somehow, but how exactly does it work?
This would be desirable instead of GetID() because it takes quite a bit of hackery to try to make sure the IDs are unique and deterministic.
The alternative would be for BaseThing to have a pure virtual GetID() which would be used to deduce the type instead of using typeid. In this situation, with only 1 level of inheritance, what's the cost of typeid vs the cost of a virtual function call? I know typeid uses the vtable somehow, but how exactly does it work?
On Linux and Mac, or anything else using the Itanium C++ ABI, typeid(x) compiles into two load instructions — it simply loads the vptr (that is, the address of some vtable) from the first 8 bytes of object x, and then loads the -1th pointer from that vtable. That pointer is &typeid(x). This is one function call less expensive than calling a virtual method.
On Windows, it involves on the order of four load instructions and a couple of (negligible) ALU ops, because the Microsoft C++ ABI is a bit more enterprisey. (source) This might end up being on par with a virtual method call, honestly. But that's still dirt cheap compared to a dynamic_cast.
A dynamic_cast involves a function call into the C++ runtime, which has a lot of loads and conditional branches and such.
So yes, exploiting typeid will be much much faster than dynamic_cast. Will it be correct for your use-case?— that's questionable. (See the other answers about Liskov substitutability and such.) But will it be fast?— yes.
Here, I took the toy benchmark code from Vaughn's highly-rated answer and made it into an actual benchmark, avoiding the obvious loop-hoisting optimization that borked all his timings. Result, for libc++abi on my Macbook:
$ g++ test.cc -lbenchmark -std=c++14; ./a.out
Run on (4 X 2400 MHz CPU s)
2017-06-27 20:44:12
Benchmark Time CPU Iterations
---------------------------------------------------------
bench_dynamic_cast 70407 ns 70355 ns 9712
bench_typeid 31205 ns 31185 ns 21877
bench_id_method 30453 ns 29956 ns 25039
$ g++ test.cc -lbenchmark -std=c++14 -O3; ./a.out
Run on (4 X 2400 MHz CPU s)
2017-06-27 20:44:27
Benchmark Time CPU Iterations
---------------------------------------------------------
bench_dynamic_cast 57613 ns 57591 ns 11441
bench_typeid 12930 ns 12844 ns 56370
bench_id_method 20942 ns 20585 ns 33965
(Lower ns is better. You can ignore the latter two columns: "CPU" just shows that it's spending all its time running and no time waiting, and "Iterations" is just the number of runs it took to get a good margin of error.)
You can see that typeid thrashes dynamic_cast even at -O0, but when you turn on optimizations, it does even better — because the compiler can optimize any code that you write. All that ugly code hidden inside libc++abi's __dynamic_cast function can't be optimized by the compiler any more than it already has been, so turning on -O3 didn't help much.
Typically, you don't just want to know the type, but also do something with the object as that type. In that case, dynamic_cast is more useful:
int main()
{
BaseThing* pThing = new Thing<B>();
if(Thing<B>* pThingB = dynamic_cast<Thing<B>*>(pThing)) {
{
// Do something with pThingB
}
else if(Thing<A>* pThingA = dynamic_cast<Thing<A>*>(pThing)) {
{
// Do something with pThingA
}
}
I think this is why you rarely see typeid used in practice.
Update:
Since this question concerns performance. I ran some benchmarks on g++ 4.5.1. With this code:
struct Base {
virtual ~Base() { }
virtual int id() const = 0;
};
template <class T> struct Id;
template<> struct Id<int> { static const int value = 1; };
template<> struct Id<float> { static const int value = 2; };
template<> struct Id<char> { static const int value = 3; };
template<> struct Id<unsigned long> { static const int value = 4; };
template <class T>
struct Derived : Base {
virtual int id() const { return Id<T>::value; }
};
static const int count = 100000000;
static int test1(Base *bp)
{
int total = 0;
for (int iter=0; iter!=count; ++iter) {
if (Derived<int>* dp = dynamic_cast<Derived<int>*>(bp)) {
total += 5;
}
else if (Derived<float> *dp = dynamic_cast<Derived<float>*>(bp)) {
total += 7;
}
else if (Derived<char> *dp = dynamic_cast<Derived<char>*>(bp)) {
total += 2;
}
else if (
Derived<unsigned long> *dp = dynamic_cast<Derived<unsigned long>*>(bp)
) {
total += 9;
}
}
return total;
}
static int test2(Base *bp)
{
int total = 0;
for (int iter=0; iter!=count; ++iter) {
const std::type_info& type = typeid(*bp);
if (type==typeid(Derived<int>)) {
total += 5;
}
else if (type==typeid(Derived<float>)) {
total += 7;
}
else if (type==typeid(Derived<char>)) {
total += 2;
}
else if (type==typeid(Derived<unsigned long>)) {
total += 9;
}
}
return total;
}
static int test3(Base *bp)
{
int total = 0;
for (int iter=0; iter!=count; ++iter) {
int id = bp->id();
switch (id) {
case 1: total += 5; break;
case 2: total += 7; break;
case 3: total += 2; break;
case 4: total += 9; break;
}
}
return total;
}
Without optimization, I got these runtimes:
test1: 2.277s
test2: 0.629s
test3: 0.469s
With optimization -O2, I got these runtimes:
test1: 0.118s
test2: 0.220s
test3: 0.290s
So it appears that dynamic_cast is the fastest method when using optimization with this compiler.
In almost all cases you don't want the exact type, but you want to make sure that it's of the given type or any type derived from it. If an object of a type derived from it cannot be substituted for an object of the type in question, then you are violating the Liskov Substitution Principle which is one of the most fundamental rules of proper OO design.
Simple example in D:
import std.stdio, std.conv, core.memory;
class Foo{
int x;
this(int _x){x=_x;}
}
void main(string args[]) {
GC.disable();
int n = to!int(args[1]);
Foo[] m= new Foo[n];
for(int i=0;i<n;i++){
m[i] = new Foo(i);
}
}
C++ code:
#include <cstdlib>
using namespace std;
class Foo{
public:
int x;
Foo(int _x);
};
Foo::Foo(int _x){
x = _x;
}
int main(int argc, char** argv) {
int n = atoi(argv[1]);
Foo** gx = new Foo*[n];
for(int i=0;i<n;i++){
gx[i] = new Foo(i);
}
return 0;
}
No any comilation flags.
compiling and runing:
>dmd td.d
>time ./td 10000000
>real 0m2.544s
Anlogue example in C++ (gcc), runing:
>time ./tc 10000000
>real 0m0.523s
Why? Such a simple example, and such a big difference: 2.54s and 0.52s.
You're mainly measuring three differences:
The difference between the code generated by gcc and dmd
The extra time D takes to allocate using the GC.
The extra time D takes to allocate a class.
Now, you might think that point 2 is invalid because you used GC.disable();, but this only makes it so that the GC won't collect as it normally does. It does not make the GC disappear entirely and automatically redirect all memory allocations to C's malloc. It still must do most of what it normally does to ensure that the GC knows about the memory allocated, and all that takes time. Normally, this is a relatively insignificant part of program execution (even ignoring the benefits GCs give). However, your benchmark makes it the entirety of the program which exaggerates this effect.
Therefore, I suggest you consider two changes to your approach:
Either switch to using gdc to compare against gcc or switch to dmc to compare to dmd
Make the programs more equivalent. Either have both D and C++ allocate structs on the heap or, at the very least, make it so that D is allocating without touching the GC. If you're optimizing a program for maximum speed, you'd be using structs and C's malloc anyway, regardless of language.
I'd even recommend a 3rd change: since you're interested in maximum performance, you ought to try to come up with a better program entirely. Why not switch to structs and have them located contiguously in memory? This would make allocation (which is, essentially, the entire program) as fast as possible.
Use of your above code running using dmd & dmc on my machine results in the following times:
DMC 8.42n (no flags) : ~880ms
DMD 2.062 (no flags) : ~1300ms
Modifying the code to the following:
C++ code:
#include <cstdlib>
struct Foo {
int x;
};
int main(int argc, char** argv) {
int n = atoi(argv[1]);
Foo* gx = (Foo*) malloc(n * sizeof(Foo));
for(int i = 0; i < n; i++) {
gx[i].x = i;
}
free(gx);
return 0;
}
D code:
import std.conv;
struct Foo{
int x;
}
void main(string args[]) {
int n = to!int(args[1]);
Foo[] m = new Foo[](n);
foreach(i, ref e; m) {
e.x = i;
}
}
Use of my code using DMD & DMC results in the following times:
DMC 8.42n (no flags) : ~95ms +- 20ms
DMD 2.062 (no flags) : ~95ms +- 20ms
Essentially, identical (I'd have to start using some statistics to give you a better idea of which one is truly faster, but at this scale, it's irrelevant). Notice that using this is much, much faster than a naive approach and D is equally capable of using this strategy. In this case, the run-time difference is negligible, yet we retain the benefits of using a GC and there is definitely far fewer things that could go wrong in the writing of the D code (Notice how your program failed to delete all of its allocations?).
Furthermore, if you absolutely wanted, D allows you to use C's standard library by import std.c.stdlib; This would allow you to truly bypass the GC and achieve maximum performance by using C's malloc, if necessary. In this case, it's not necessary, so I erred on the side of safer, more readable code.
try this one:
import std.stdio, std.conv, core.memory;
class Foo{
int x = void;
this(in int _x){x=_x;}
}
void main(string args[]) {
GC.disable();
int n = to!int(args[1]);
Foo[] m= new Foo[n];
foreach(i; 0..n){
m[i] = new Foo(i);
}
}