LTO, Devirtualization, and Virtual Tables - c++

Comparing virtual functions in C++ and virtual tables in C, do compilers in general (and for sufficiently large projects) do as good a job at devirtualization?
Naively, it seems like virtual functions in C++ have slightly more semantics, thus may be easier to devirtualize.
Update: Mooing Duck mentioned inlining devirtualized functions. A quick check shows missed optimizations with virtual tables:
struct vtab {
int (*f)();
};
struct obj {
struct vtab *vtab;
int data;
};
int f()
{
return 5;
}
int main()
{
struct vtab vtab = {f};
struct obj obj = {&vtab, 10};
printf("%d\n", obj.vtab->f());
}
My GCC will not inline f, although it is called directly, i.e., devirtualized. The equivalent in C++,
class A
{
public:
virtual int f() = 0;
};
class B
{
public:
int f() {return 5;}
};
int main()
{
B b;
printf("%d\n", b.f());
}
does even inline f. So there's a first difference between C and C++, although I don't think that the added semantics in the C++ version are relevant in this case.
Update 2: In order to devirtualize in C, the compiler has to prove that the function pointer in the virtual table has a certain value. In order to devirtualize in C++, the compiler has to prove that the object is an instance of a particular class. It would seem that the proof is harder in the first case. However, virtual tables are typically modified in only very few places, and most importantly: just because it looks harder, doesn't mean that compilers aren't as good in it (for otherwise you might argue that xoring is generally faster than adding two integers).

The difference is that in C++, the compiler can guarantee that the virtual table address never changes. In C then it's just another pointer and you could wreak any kind of havoc with it.
However, virtual tables are typically modified in only very few places
The compiler doesn't know that in C. In C++, it can assume that it never changes.

I tried to summarize in http://hubicka.blogspot.ca/2014/01/devirtualization-in-c-part-2-low-level.html why generic optimizations have hard time to devirtualize. Your testcase gets inlined for me with GCC 4.8.1, but in slightly less trivial testcase where you pass pointer to your "object" out of main it will not.
The reason is that to prove that the virtual table pointer in obj and the virtual table itself did not change the alias analysis module has to track all possible places you can point to it. In a non-trivial code where you pass things outside of the current compilation unit this is often a lost game.
C++ gives you more information on when type of object may change and when it is known. GCC makes use of it and it will make a lot more use of it in the next release. (I will write on that soon, too).

Yes, if it is possible for the compiler to deduce the exact type of a virtualized type, it can "devirtualize" (or even inline!) the call. A compiler can only do this if it can guarantee that no matter what, this is the function needed.
The major concern is basically threading. In the C++ example, the guarantees hold even in a threaded environment. In C, that can't be guaranteed, because the object could be grabbed by another thread/process, and overwritten (deliberately or otherwise), so the function is never "devirtualized" or called directly. In C the lookup will always be there.
struct A {
virtual void func() {std::cout << "A";};
}
struct B : A {
virtual void func() {std::cout << "B";}
}
int main() {
B b;
b.func(); //this will inline in optimized builds.
}

It depends on what you are comparing compiler inlining to. Compared to link time or profile guided or just in time optimizations, compilers have less information to use. With less information, the compile time optimizations will be more conservative (and do less inlining overall).
A compiler will still generally be pretty decent at inlining virtual functions as it is equivalent to inlining function pointer calls (say, when you pass a free function to an STL algorithm function like sort or for_each).

Related

Can C++ compilers optimize away a class?

Let's say I have a class that's something like this:
class View
{
public:
View(DataContainer &c)
: _c(c)
{
}
inline Elem getElemForCoords(double x, double y)
{
int idx = /* some computation here... */;
return _c.data[idx];
}
private:
DataContainer& _c;
};
If I have a function using this class, is the compiler allowed to optimize it away entirely and just inline the data access?
Is the same still true if View::_c happens to be a std::shared_ptr?
If I have a function using this class, is the compiler allowed to
optimize it away entirely and just inline the data access?
Is the same still true if View::_c happens to be a std::shared_ptr?
Absolutely, yes, and yes; as long as it doesn't violate the as-if rule (as already pointed out by Pentadecagon). Whether this optimization really happens is a much more interesting question; it is allowed by the standard. For this code:
#include <memory>
#include <vector>
template <class DataContainer>
class View {
public:
View(DataContainer& c) : c(c) { }
int getElemForCoords(double x, double y) {
int idx = x*y; // some dumb computation
return c->at(idx);
}
private:
DataContainer& c;
};
template <class DataContainer>
View<DataContainer> make_view(DataContainer& c) {
return View<DataContainer>(c);
}
int main(int argc, char* argv[]) {
auto ptr2vec = std::make_shared<std::vector<int>>(2);
auto view = make_view(ptr2vec);
return view.getElemForCoords(1, argc);
}
I have verified, by inspecting the assembly code (g++ -std=c++11 -O3 -S -fwhole-program optaway.cpp), that the View class is like it is not there, it adds zero overhead.
Some unsolicited advice.
Inspect the assembly code of your programs; you will learn a lot and start worrying about the right things. shared_ptr is a heavy-weight object (compared to, for example, unique_ptr), partly because of all that multi-threading machinery under the hood. If you look at the assembly code, you will worry much more about the overhead of the shared pointer and less about element access. ;)
The inline in your code is just noise, that function is implicitly inline anyway. Please don't trash your code with the inline keyword; the optimizer is free to treat it as whitespace anyway. Use link time optimization instead (-flto with gcc). GCC and Clang are surprisingly smart compilers and generate good code.
Profile your code instead of guessing and doing premature optimization. Perf is a great tool.
Want speed? Measure. (by Howard Hinnant)
In general, compilers don't optimize away classes. Usually, they optimize functions.
The compiler may decide to take the content of simple inlined functions and paste the content where the function is invoked, rather than making the inlined function a hard-coded function (i.e. it would have an address). This optimization depends on the compiler's optimization level.
The compiler and linker may decide to drop functions that are not used, whether they be class methods or free standing.
Think of the class as a stencil for describing an object. The stencil isn't any good without an instance. An exception is a public static function within the class (static methods don't require object instances). The class is usually kept in the compiler's dictionary.

Can subclass inline a pure virtual method that is not inline in the base?

As I understand it, the compiler can inline a virtual function call when it knows at compile time what the type of the object will be at runtime (C++ faq).
What happens, however, when one is implementing a pure virtual method from a base class? Do the same rules apply? Will the following function call be inlined?
class base
{
public:
virtual void print() = 0;
virtual void callPrint()
{
print(); // will this be inline?
}
};
class child : public base
{
public:
void print() { cout << "hello\n"; }
};
int main()
{
child c;
c.callPrint();
return 0;
}
EDIT:
I think my original example code was actually a poor representation of what I wanted to ask. I've updated the code, but the question remains the same.
The compiler is never required to inline a function call. In this case, it is permitted to inline the function call, because it knows the concrete type of c (since it is not indirected through a pointer or reference, the compiler can see where it was allocated as a child). As such, the compiler knows which implementation of print() is used, and can choose not to perform vtable indirection, and further choose to inline the implementation of the function.
However, the compiler is also free to not inline it; it might insert a direct call to child::print(), or indirect through the vtable, if it decides to do so.
These optimizations in general boil down to the 'as-if' rule - the compiler must behave as-if it was doing a full vtable indirect - this means that the result must be the same, but the compiler can choose a different method of achieving the result if the result is the same. This includes inlining, etc.
The answer is of course "it depends", but in principle there's no obstruction to optimization. In fact, you're not even doing anything polymorphic here, so this is really straight-forward.
The question would be more interesting if you had code like this:
child c;
base & b = c;
b.print();
The point is that the compiler knows at this point what the ultimate target of the dynamic dispatch will be (namly child::print()), so this is eligible for optimization. (There are two separate opportunities for optimization, of course: one by avoiding the dynamic dispatch, and one coming from having the function body of the target visible in the TU.)
There are only a couple of rules you should be aware of:
1) The compiler is never forced to inline - even using the directive or defining a method in the header.
2) Polymorphism MUST ALWAYS WORK. This means that the compiler will prefer calling the function via the vftable rather than inlining it when the possibility of dynamic calls exists.

using virtual functions versus static_cast from base to derived

I am trying to understand which implementation below is "faster". Assume that one compiles this code with and without the -DVIRTUAL flag.
I assume that compiling without -DVIRTUAL will be faster because:
a] There is no vtable used
b] The compiler might be able to optimize the assembly instructions because it "knows" exactly which call will be made given the various options (there are only a finite number of options).
My question is PURELY related to speed, not pretty code.
a] Am I correct in my analysis above?
b] Will the branch predictor / compiler combination be intelligent enough to optimize for a given branch of the switch statement? See that the "type" is a const int.
c] Are there any other factors that I am missing?
Thanks!
#include <iostream>
class Base
{
public:
Base(int t) : type(t) {}
~Base() {}
const int type;
#ifdef VIRTUAL
virtual void fn1()=0;
#else
void fn2();
#endif
};
class Derived1 : public Base
{
public:
Derived1() : Base(1) { }
~Derived1() {}
void fn1() { std::cout << "in Derived1()" << std::endl; }
};
class Derived2 : public Base
{
public:
Derived2() : Base(2) { }
~Derived2() { }
void fn1() { std::cout << "in Derived2()" << std::endl; }
};
#ifndef VIRTUAL
void Base::fn2()
{
switch(type)
{
case 1:
(static_cast<Derived1* const>(this))->fn1();
break;
case 2:
(static_cast<Derived2* const>(this))->fn1();
break;
default:
break;
};
}
#endif
int main()
{
Base *test = new Derived1();
#ifdef VIRTUAL
test->fn1();
#else
test->fn2();
#endif
return 0;
}
I think you misunderstand the VTable. The VTable is simply a jump table (In most implementations though AFAIK the spec does not guarantee this!). In fact you could go as far as saying its a giant switch statement. As such I'd wager the speed would be exactly the same with both your methods.
If anything I'd imagine the VTable method would be slightly faster as the compiler can make better decisions to optimise for cache alignment and so forth...
Have you measured the performance to see if there's even any difference at all?
I suppose not, because then you wouldn't be asking here. It's the only reasonable response though.
Assuming that you are not prematurely micro-optimizing pointlessly, and you have profiled your code and found this to be a problem that needs solving, the best way to figure out the answer to your question is to compile both in release with full optimizations and examine the generated machine code.
It's impossible to answer without specifying compiler and compiler options.
I see no particular reason why your non-virtual code should necessarily be any faster to make the call than the virtual code. In fact, the switch might well be slower than a vtable, since a call using a vtable will load an address and jump to it, whereas the switch will load an integer and do a little bit of thinking. Either one of them could be faster. For obvious reasons, a virtual call is not specified by the standard to be "slower than any other thing you invent to replace it".
I think it's reasonably unlikely that a randomly-chosen compiler will actually inline the call in the virtual case, but it's certainly allowed to (under the as-if rule), since the dynamic type of *test could be determined by data-flow analysis or similar. I think it's reasonably likely that with optimization enabled a randomly-chosen compiler will inline everything in the non-virtual case. But then, you've given a small example with very short functions all in one TU, so inlining is especially easy.
It depends on the platform and the compiler. A switch statement can be implemented as a test and branch or a jump table (i.e., an indirect branch). A virtual function is usually implemented as an indirect branch. If your compiler turns the switch statement into a jump table, the two approaches differ by one additional dereference. If that is the case and this particular usage happens infrequently enough (or thrashes the cache enough) then you might see a difference due to an extra cache miss.
On the other hand, if the switch statement is simply a test and branch, you might see a much bigger performance difference on some in-order CPUs that flush the instruction cache on an indirect branch (or require a high latency between setting the destination of an indirect branch and jumping to it).
If you are really concerned with the overhead of virtual function dispatch, say, for an inner loop over a heterogenous collection of objects, you might want to reconsider where you perform the dynamic dispatch. It doesn't have to be per object; it could also be per known groupings of objects with the same type.
It is not necessarily true that avoiding vtables will be faster - to be sure, you should measure yourself.
Note that:
The static_cast version may introduce a branch (likely not to, if it gets optimized to a jump table),
The vtable version on all implementations I know will result in a jump table.
See a pattern here?
Generally, you'd prefer linear time lookup, not branching the code, so the virtual function method seems to be better.

How is inheritance implemented at the memory level?

Suppose I have
class A { public: void print(){cout<<"A"; }};
class B: public A { public: void print(){cout<<"B"; }};
class C: public A { };
How is inheritance implemented at the memory level?
Does C copy print() code to itself or does it have a pointer to the it that points somewhere in A part of the code?
How does the same thing happen when we override the previous definition, for example in B (at the memory level)?
Compilers are allowed to implement this however they choose. But they generally follow CFront's old implementation.
For classes/objects without inheritance
Consider:
#include <iostream>
class A {
void foo()
{
std::cout << "foo\n";
}
static int bar()
{
return 42;
}
};
A a;
a.foo();
A::bar();
The compiler changes those last three lines into something similar to:
struct A a = <compiler-generated constructor>;
A_foo(a); // the "a" parameter is the "this" pointer, there are not objects as far as
// assembly code is concerned, instead member functions (i.e., methods) are
// simply functions that take a hidden this pointer
A_bar(); // since bar() is static, there is no need to pass the this pointer
Once upon a time I would have guessed that this was handled with pointers-to-functions in each A object created. However, that approach would mean that every A object would contain identical information (pointer to the same function) which would waste a lot of space. It's easy enough for the compiler to take care of these details.
For classes/objects with non-virtual inheritance
Of course, that wasn't really what you asked. But we can extend this to inheritance, and it's what you'd expect:
class B : public A {
void blarg()
{
// who knows, something goes here
}
int bar()
{
return 5;
}
};
B b;
b.blarg();
b.foo();
b.bar();
The compiler turns the last four lines into something like:
struct B b = <compiler-generated constructor>
B_blarg(b);
A_foo(b.A_portion_of_object);
B_bar(b);
Notes on virtual methods
Things get a little trickier when you talk about virtual methods. In that case, each class gets a class-specific array of pointers-to-functions, one such pointer for each virtual function. This array is called the vtable ("virtual table"), and each object created has a pointer to the relevant vtable. Calls to virtual functions are resolved by looking up the correct function to call in the vtable.
Check out the C++ ABI for any questions regarding the in-memory layout of things. It's labelled "Itanium C++ ABI", but it's become the standard ABI for C++ implemented by most compilers.
I don't think the standard makes any guarantees. Compilers can choose to make multiple copies of functions, combine copies that happen to access the same memory offsets on totally different types, etc. Inlining is just one of the more obvious cases of this.
But most compilers will not generate a copy of the code for A::print to use when called through a C instance. There may be a pointer to A in the compiler's internal symbol table for C, but at runtime you're most likely going to see that:
A a; C c; a.print(); c.print();
has turned into something much along the lines of:
A a;
C c;
ECX = &a; /* set up 'this' pointer */
call A::print;
ECX = up_cast<A*>(&c); /* set up 'this' pointer */
call A::print;
with both call instructions jumping to the exact same address in code memory.
Of course, since you've asked the compiler to inline A::print, the code will most likely be copied to every call site (but since it replaces the call A::print, it's not actually adding much to the program size).
There will not be any information stored in a object to describe a member function.
aobject.print();
bobject.print();
cobject.print();
The compiler will just convert the above statements to direct call to function print, essentially nothing is stored in a object.
pseudo assembly instruction will be like below
00B5A2C3 call print(006de180)
Since print is member function you would have an additional parameter; this pointer. That will be passes as just every other argument to the function.
In your example here, there's no copying of anything. Generally an object doesn't know what class it's in at runtime -- what happens is, when the program is compiled, the compiler says "hey, this variable is of type C, let's see if there's a C::print(). No, ok, how about A::print()? Yes? Ok, call that!"
Virtual methods work differently, in that pointers to the right functions are stored in a "vtable"* referenced in the object. That still doesn't matter if you're working directly with a C, cause it still follows the steps above. But for pointers, it might say like "Oh, C::print()? The address is the first entry in the vtable." and the compiler inserts instructions to grab that address at runtime and call to it.
* Technically, this is not required to be true. I'm pretty sure you won't find any mention in the standard of "vtables"; it's by definition implementation-specific. It just happens to be the method the first C++ compilers used, and happens to work better all-around than other methods, so it's the one nearly every C++ compiler in existence uses.

Speeding up virtual function calls in gcc

Profiling my C++ code with gprof, I discovered that a significant portion of my time is spent calling one virtual method over and over. The method itself is short and could probably be inlined if it wasn't virtual.
What are some ways I could speed this up short of rewriting it all to not be virtual?
Are you sure the time is all call-related? Could it be the function itself where the cost is? If this is the case simply inlining things might make the function vanish from your profiler but you won't see much speed-up.
Assuming it really is the overhead of making so many virtual calls there's a limit to what you can do without making things non-virtual.
If the call has early-outs for things like time/flags then I'll often use a two-level approach. The checking is inlined with a non-virtual call, with the class-specific behavior only called if necessary.
E.g.
class Foo
{
public:
inline void update( void )
{
if (can_early_out)
return;
updateImpl();
}
protected:
virtual void updateImpl( void ) = 0;
};
If the virtual calling really is the bottleneck give CRTP a try.
Is the time being spent in the actual function call, or in the function itself?
A virtual function call is noticeably slower than a non-virtual call, because the virtual call requires an extra dereference. (Google for 'vtable' if you want to read all the hairy details.) )Update: It turns out the Wikipedia article isn't bad on this.
"Noticeably" here, though, means a couple of instructions If it's consuming a significant part of the total computation including time spent in the called function, that sounds like a marvelous place to consider unvirtualizing and inlining.
But in something close to 20 years of C++, I don't think I've ever seen that really happen. I'd love to see the code.
Please be aware that "virtual" and "inline" are not opposites -- a method can be both. The compiler will happily inline a virtual function if it can determine the type of the object at compile time:
struct B {
virtual int f() { return 42; }
};
struct D : public B {
virtual int f() { return 43; }
};
int main(int argc, char **argv) {
B b;
cout << b.f() << endl; // This call will be inlined
D d;
cout << d.f() << endl; // This call will be inlined
B& rb = rand() ? b : d;
cout << rb.f() << endl; // Must use virtual dispatch (i.e. NOT inlined)
return 0;
}
[UPDATE: Made certain rb's true dynamic object type cannot be known at compile time -- thanks to MSalters]
If the type of the object can be determined at compile time but the function is not inlineable (e.g. it is large or is defined outside of the class definition), it will be called non-virtually.
It's sometimes instructive to consider how you'd write the code in good old 'C' if you didn't have C++'s syntactic sugar available. Sometimes the answer isn't using an indirect call. See this answer for an example.
You might be able get a little better performance from the virtual call by changing the calling convention. The old Borland compiler had a __fastcall convention which passed arguments in cpu registers instead of on the stack.
If you're stuck with the virtual call and those few operations really count, then check your compiler documentation for supported calling conventions.
Here is one possible way to do it using RTTI.