Virtual function overhead of negative 0.6ns per call?

Virtual function overhead of negative 0.6ns per call? - c++

I wrote the following benchmark to estimate the overhead of virtual functions:
struct A{
int i = 0 ;
virtual void inc() __attribute__((noinline));
};
#ifdef VIRT
struct B : public A{
void inc() override __attribute__((noinline));
};
void A::inc() { }
void B::inc() { i++; }
#else
void A::inc() { i++; }
#endif
int main(){
#ifdef VIRT
B b;
A* p = &b;
#else
A a;
A* p = &a;
#endif
for( ;p->i < IT; p->inc()) {; }
return 0;
}
I compile it with
G=$((1000**3))
g++ -O1 -DIT=$((1*G)) -DVIRT virt.cc -o virt
g++ -O1 -DIT=$((1*G)) virt.cc -o nonvirt
And the results I got were that nonvirt was about 0.6ns slower than virt per function call at -O1 and about 0.3ns slower than virt at -O2 per function call.
How is this possible? I thought virtual functions were supposed to be slower.

First, just because you invoke a method through a pointer doesn't mean the compiler can't figure out the target type and make the call non-virtual. Plus, your program does nothing else, so everything will be well-predicted and in cache. Finally, a difference of 0.3 ns is one cycle, which is hardly worth noting. If you really want to dig into it, you could inspect the assembly code for each case on whatever your platform is.
On my system (Clang, OS X, old Macbook Air), the virtual case is a little slower, but it's hardly measurable with -O1 (e.g. 3.7 vs 3.6 seconds for non-virtual). And with -O2 there's no difference I can distinguish.

EDIT: Has been corrected
Your main is wrong. the for loop is defined 2 times in one case and once in the other. This should not impact performance since the second time the loop exits immediately ?
Correct it like that :
int main(){
#ifdef VIRT
B b;
A* p = &b;
/* removed this for loop */
#else
A a;
A* p = &a;
#endif
for( ;p->i < IT; p->inc()) {; }
return 0;
}

Related

Virtual function calling cost is 1.5x of normal function call (with test case)

I have to decide whether to use template vs virtual-inheritance.
In my situation, the trade-off make it really hard to choose.
Finally, it boiled down to "How much virtual-calling is really cost (CPU)?"
I found very few resources that dare to measure the vtable cost in actual number e.g. https://stackoverflow.com/a/158644, which point to page 26 of http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf.
Here is an excerpt from it:-
However, this overhead (of virtual) is on the order of 20% and 12% – far less than
the variability between compilers.
Before relying on the fact, I have decided to test it myself.
My test code is a little long (~ 40 lines), you can also see it in the links in action.
The number is ratio of time that virtual-calling used divided by normal-calling.
Unexpectedly, the result is contradict to what open-std stated.
http://coliru.stacked-crooked.com/a/d4d161464e83933f : 1.58
http://rextester.com/GEZMC77067 (with custom -O2): 1.89
http://ideone.com/nmblnK : 2.79
My own desktop computer (Visual C++, -O2) : around 1.5
Here is it :-
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
class B2{
public: int randomNumber=((double) rand() / (RAND_MAX))*10;
virtual ~B2() = default;
virtual int f(int n){return -n+randomNumber;}
int g(int n){return -n+randomNumber;}
};
class C : public B2{
public: int f(int n) override {return n-randomNumber;}
};
int main() {
std::vector<B2*> bs;
const int numTest=1000000;
for(int n=0;n<numTest;n++){
if(((double) rand() / (RAND_MAX))>0.5){
bs.push_back(new B2());
}else{
bs.push_back(new C());
}
};
auto t1 = std::chrono::system_clock::now();
int s=0;
for(int n=0;n<numTest;n++){
s+=bs[n]->f(n);
};
auto t2= std::chrono::system_clock::now();
for(int n=0;n<numTest;n++){
s+=bs[n]->g(n);
};
auto t3= std::chrono::system_clock::now();
auto t21=t2-t1;
auto t32=t3-t2;
std::cout<<t21.count()<<" "<<t32.count()<<" ratio="<< (((float)t21.count())/t32.count()) << std::endl;
std::cout<<s<<std::endl;
for(int n=0;n<numTest;n++){
delete bs[n];
};
}
Question
Is it what to be expect that virtual calling is at least +50% slower than normal calling?
Did I test it in a wrong-way?
I have also read :-
AI Applications in C++: How costly are virtual functions? What are the possible optimizations?
Virtual functions and performance - C++

Can GCC optimize methods of a class with compile-time constant variables?

Preamble
I'm using avr-g++ for programming AVR microcontrollers and therefore I always need to get very efficient code.
GCC usually can optimize a function if its argument are compile-time constants, e.g. I have function pin_write(uint8_t pin, bool val) which determine AVR's registers for the pin (using my special map from integer pin to a pair port/pin) and write to these registers correspondent values. This function isn't too small, because of its generality. But if I call this function with compile-time constant pin and val, GCC can make all calculations at compile-time and eliminate this call to a couple of AVR instructions, e.g.
sbi PORTB,1
sbi DDRB,1
Amble
Let's write a code like this:
class A {
int x;
public:
A(int x_): x(x_) {}
void foo() { pin_write(x, 1); }
};
A a(8);
int main() {
a.foo();
}
We have only one object of class A and it's initialized with a constant (8). So, it's possible to make all calculations at compile-time:
foo() -> pin_write(x,1) -> pin_write(8,1) -> a couple of asm instructions
But GCC doesn't do so.
Surprisely, but if I remove global A a(8) and write just
A(8).foo()
I get exactly what I want:
00000022 <main>:
22: c0 9a sbi 0x18, 0 ; 24
24: b8 9a sbi 0x17, 0 ; 23
Question
So, is there a way to force GCC make all possible calculation at compile-time for single global objects with constant initializers?
Because of this trouble I have to manually expand such cases and replace my original code with this:
const int x = 8;
class A {
public:
A() {}
void foo() { pin_write(x, 1); }
}
UPD. It very wonderful: A(8).foo() inside main optimized to 2 asm instructions. A a(8); a.foo() too! But if I declare A a(8) as global -- compiler produce big general code. I tried to add static -- it didn't help. Why?

But if I declare A a(8) as global -- compiler produce big general code. I tried to add static -- it didn't help. Why?
In my experience, gcc is very reluctant if the object / function has external linkage. Since we don't have your code to compile, I made a slightly modified version of your code:
#include <cstdio>
class A {
int x;
public:
A(int x_): x(x_) {}
int f() { return x*x; }
};
A a(8);
int main() {
printf("%d", a.f());
}
I have found 2 ways to achive that the generated assembly corresponds to this:
int main() {
printf("%d", 64);
}
In words: to eliminate everything at compile time so that only the necessary minimum remains.
One way to achive this both with clang and gcc is:
#include <cstdio>
class A {
int x;
public:
constexpr A(int x_): x(x_) {}
constexpr int f() const { return x*x; }
};
constexpr A a(8);
int main() {
printf("%d", a.f());
}
gcc 4.7.2 already eliminates everything at -O1, clang 3.5 trunk needs -O2.
Another way to achieve this is:
#include <cstdio>
class A {
int x;
public:
A(int x_): x(x_) {}
int f() const { return x*x; }
};
static const A a(8);
int main() {
printf("%d", a.f());
}
It only works with clang at -O3. Apparently the constant folding in gcc is not that aggressive. (As clang shows, it can be done but gcc 4.7.2 did not implement it.)

You can force the compiler to fully optimize the function with all known constants by changing the pin_write function into a template. I don't know if the particular behavior is guaranteed by the standard though.
template< int a, int b >
void pin_write() { some_instructions; }
This will probably require fixing all lines where pin_write is used.
Additionally, you can declare the function as inline. The compiler isn't guaranteed to inline the function (the inline keyword is just an hint), but if it does, it has a greater chance to optimize compile time constants away (assuming the compiler can know it is an compile time constant, which may be not always the case).

Your a has external linkage, so the compiler can't be sure that there isn't other code somewhere modifying it.
If you were to declare a const then you make clear it shouldn't change, and also stop it having external linkage; both of those should help the compiler to be less pessimistic.
(I'd probably declare x const too - it may not help here, but if nothing else it makes it clear to the compiler and the next reader of the code that you never change it.)

What's the cost of typeid?

I'm considering a type erasure setup that uses typeid to resolve the type like so...
struct BaseThing
{
virtual ~BaseThing() = 0 {}
};
template<typename T>
struct Thing : public BaseThing
{
T x;
};
struct A{};
struct B{};
int main()
{
BaseThing* pThing = new Thing<B>();
const std::type_info& x = typeid(*pThing);
if( x == typeid(Thing<B>))
{
std::cout << "pThing is a Thing<B>!\n";
Thing<B>* pB = static_cast<Thing<B>*>(pThing);
}
else if( x == typeid(Thing<A>))
{
std::cout << "pThing is a Thing<A>!\n";
Thing<A>* pA = static_cast<Thing<A>*>(pThing);
}
}
I've never seen anyone else do this. The alternative would be for BaseThing to have a pure virtual GetID() which would be used to deduce the type instead of using typeid. In this situation, with only 1 level of inheritance, what's the cost of typeid vs the cost of a virtual function call? I know typeid uses the vtable somehow, but how exactly does it work?
This would be desirable instead of GetID() because it takes quite a bit of hackery to try to make sure the IDs are unique and deterministic.

The alternative would be for BaseThing to have a pure virtual GetID() which would be used to deduce the type instead of using typeid. In this situation, with only 1 level of inheritance, what's the cost of typeid vs the cost of a virtual function call? I know typeid uses the vtable somehow, but how exactly does it work?
On Linux and Mac, or anything else using the Itanium C++ ABI, typeid(x) compiles into two load instructions — it simply loads the vptr (that is, the address of some vtable) from the first 8 bytes of object x, and then loads the -1th pointer from that vtable. That pointer is &typeid(x). This is one function call less expensive than calling a virtual method.
On Windows, it involves on the order of four load instructions and a couple of (negligible) ALU ops, because the Microsoft C++ ABI is a bit more enterprisey. (source) This might end up being on par with a virtual method call, honestly. But that's still dirt cheap compared to a dynamic_cast.
A dynamic_cast involves a function call into the C++ runtime, which has a lot of loads and conditional branches and such.
So yes, exploiting typeid will be much much faster than dynamic_cast. Will it be correct for your use-case?— that's questionable. (See the other answers about Liskov substitutability and such.) But will it be fast?— yes.
Here, I took the toy benchmark code from Vaughn's highly-rated answer and made it into an actual benchmark, avoiding the obvious loop-hoisting optimization that borked all his timings. Result, for libc++abi on my Macbook:
$ g++ test.cc -lbenchmark -std=c++14; ./a.out
Run on (4 X 2400 MHz CPU s)
2017-06-27 20:44:12
Benchmark Time CPU Iterations
---------------------------------------------------------
bench_dynamic_cast 70407 ns 70355 ns 9712
bench_typeid 31205 ns 31185 ns 21877
bench_id_method 30453 ns 29956 ns 25039
$ g++ test.cc -lbenchmark -std=c++14 -O3; ./a.out
Run on (4 X 2400 MHz CPU s)
2017-06-27 20:44:27
Benchmark Time CPU Iterations
---------------------------------------------------------
bench_dynamic_cast 57613 ns 57591 ns 11441
bench_typeid 12930 ns 12844 ns 56370
bench_id_method 20942 ns 20585 ns 33965
(Lower ns is better. You can ignore the latter two columns: "CPU" just shows that it's spending all its time running and no time waiting, and "Iterations" is just the number of runs it took to get a good margin of error.)
You can see that typeid thrashes dynamic_cast even at -O0, but when you turn on optimizations, it does even better — because the compiler can optimize any code that you write. All that ugly code hidden inside libc++abi's __dynamic_cast function can't be optimized by the compiler any more than it already has been, so turning on -O3 didn't help much.

Typically, you don't just want to know the type, but also do something with the object as that type. In that case, dynamic_cast is more useful:
int main()
{
BaseThing* pThing = new Thing<B>();
if(Thing<B>* pThingB = dynamic_cast<Thing<B>*>(pThing)) {
{
// Do something with pThingB
}
else if(Thing<A>* pThingA = dynamic_cast<Thing<A>*>(pThing)) {
{
// Do something with pThingA
}
}
I think this is why you rarely see typeid used in practice.
Update:
Since this question concerns performance. I ran some benchmarks on g++ 4.5.1. With this code:
struct Base {
virtual ~Base() { }
virtual int id() const = 0;
};
template <class T> struct Id;
template<> struct Id<int> { static const int value = 1; };
template<> struct Id<float> { static const int value = 2; };
template<> struct Id<char> { static const int value = 3; };
template<> struct Id<unsigned long> { static const int value = 4; };
template <class T>
struct Derived : Base {
virtual int id() const { return Id<T>::value; }
};
static const int count = 100000000;
static int test1(Base *bp)
{
int total = 0;
for (int iter=0; iter!=count; ++iter) {
if (Derived<int>* dp = dynamic_cast<Derived<int>*>(bp)) {
total += 5;
}
else if (Derived<float> *dp = dynamic_cast<Derived<float>*>(bp)) {
total += 7;
}
else if (Derived<char> *dp = dynamic_cast<Derived<char>*>(bp)) {
total += 2;
}
else if (
Derived<unsigned long> *dp = dynamic_cast<Derived<unsigned long>*>(bp)
) {
total += 9;
}
}
return total;
}
static int test2(Base *bp)
{
int total = 0;
for (int iter=0; iter!=count; ++iter) {
const std::type_info& type = typeid(*bp);
if (type==typeid(Derived<int>)) {
total += 5;
}
else if (type==typeid(Derived<float>)) {
total += 7;
}
else if (type==typeid(Derived<char>)) {
total += 2;
}
else if (type==typeid(Derived<unsigned long>)) {
total += 9;
}
}
return total;
}
static int test3(Base *bp)
{
int total = 0;
for (int iter=0; iter!=count; ++iter) {
int id = bp->id();
switch (id) {
case 1: total += 5; break;
case 2: total += 7; break;
case 3: total += 2; break;
case 4: total += 9; break;
}
}
return total;
}
Without optimization, I got these runtimes:
test1: 2.277s
test2: 0.629s
test3: 0.469s
With optimization -O2, I got these runtimes:
test1: 0.118s
test2: 0.220s
test3: 0.290s
So it appears that dynamic_cast is the fastest method when using optimization with this compiler.

In almost all cases you don't want the exact type, but you want to make sure that it's of the given type or any type derived from it. If an object of a type derived from it cannot be substituted for an object of the type in question, then you are violating the Liskov Substitution Principle which is one of the most fundamental rules of proper OO design.

how to load (mov) a class variable into cpu register

I would like to load an int value defined as class member to eax cpu register. smt like this:
class x
{
int a;
x() { a=5; }
do_stuff()
{
__asm
{
mov eax, a; // mov a varbiale content into register
};
}
};
how can I do that with any variable in any class? thanks...

If you are using gcc or icc, this does exactly what you want:
class x
{
int a;
x() { a=5; }
do_stuff()
{
asm("mov %0, %%eax; // mov a varbiale content into register \n"
:
: "r" (a)
: "%eax"
);
}
};
An explanation of rgister constrains and clobber lists is at http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html if you are still interested. On MS VC I would not recommend using inline assembly, because the MS compiler is not good at mixing inline and generated assmbely and produces a lot of overhead.

The best way to load a class variable into a register is by letting the compiler do it for you.
Since the compiler always have to assume the worst, when it comes to reading things from memory, it might be forced to read the value from a class member multiple times, despite the fact that you know that the value has not changed.
For example, in the code below in the function do_stuff, the compiler can not assume that the value of mX is unchanged over the call to foo(). Hence, it is not allowed to place it in a processor register.
int glob;
int glob2;
void foo();
class Test
{
public:
Test(int x) : mX(x) { }
void do_stuff();
private:
int mX;
};
void Test::do_stuff()
{
glob = mX;
foo();
glob2 = mX;
}
On the other hand, in the following case, the source code is written so that tmp can't change over the function call, so the compiler is safe to place it in a register.
void Test::do_stuff()
{
int tmp = mX;
glob = tmp;
foo();
glob2 = tmp;
}
When it comes to inline assembler, I would strongly discourage you from using it. When introducing C++ the situation is even worse, as the underlying representation of a class object is not as straight-forward as that of a C struct (which must be laid out in memory exactly as declared). This mean that a slight change in the class or a migration to a newer compiler could break your code.

You need to get the address of of the instance of class "x" you're working with (mov eax,this), and the offset of "a" relative to the start of "x" (offsetof macro).
It's a lot easier if you just add "a" or "&a" as a parameter of the do_stuff() function.

C++ virtual function call versus boost::function call speedwise

I wanted to know how fast is a single-inheritance virtual function call when compared to one same boost::function call. Are they almost the same in performance or is boost::function slower?
I'm aware that performance may vary from case to case, but, as a general rule, which is faster, and to a how large degree is that so?
Thanks,
Guilherme
-- edit
KennyTM's test was sufficiently convincing for me. boost::function doesn't seem to be that much slower than a vcall for my own purposes. Thanks.

As a very special case, consider calling an empty function 109 times.
Code A:
struct X {
virtual ~X() {}
virtual void do_x() {};
};
struct Y : public X {}; // for the paranoid.
int main () {
Y* x = new Y;
for (int i = 100000000; i >= 0; -- i)
x->do_x();
delete x;
return 0;
}
Code B: (with boost 1.41):
#include <boost/function.hpp>
struct X {
void do_x() {};
};
int main () {
X* x = new X;
boost::function<void (X*)> f;
f = &X::do_x;
for (int i = 100000000; i >= 0; -- i)
f(x);
delete x;
return 0;
}
Compile with g++ -O3, then time with time,
Code A takes 0.30 seconds.
Code B takes 0.54 seconds.
Inspecting the assembly code, it seems that the slowness may be due to exceptions and handling the possibility and that f can be NULL. But given the price of one boost::function call is only 2.4 nanoseconds (on my 2 GHz machine), the actual code in your do_x() could shadow this pretty much. I would say, it's not a reason to avoid boost::function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Virtual function overhead of negative 0.6ns per call? - c++

Related

Virtual function calling cost is 1.5x of normal function call (with test case)

Can GCC optimize methods of a class with compile-time constant variables?

What's the cost of typeid?

how to load (mov) a class variable into cpu register

C++ virtual function call versus boost::function call speedwise

Categories

Resources