CUDA __device__ function as class member: Inlining and performance?

CUDA __device__ function as class member: Inlining and performance? - c++

I plan to partition my computation into a fine-grained framework of functions/classes which encapsulate a certain part.
Something like this, but with even more classes and typically longer parameter lists:
class Point{
Coordinates thisPoint;
Value getPointValue();
Point getPoint(Offset offset);
Point getNumNeighbors();
Point getNeighbor(int i);
// many more
}
class Operator{
void doOperation(Point p){
// calls some of the functions in Point
}
}
Clearly, this would be a good practice in any object oriented language. But it's intended to run on a CUDA GPU. What I don't know: When I qualify all these fine-grained functions as __device__ and call them in a kernel - how will they be implemented? Will I have a significant overhead for the calls of the member functions or will this be inlined or otherwise efficiently optimized? Normally, these functions are extremely short but called many, many times.

The GPU compiler will aggressively inline functions for performance reasonse. In that case, there should be no particular impact to performance.
If a function cannot be inlined, then the usual performance overhead would occur, involving the creation of a stack frame and a call to a function -just as you would observe on a CPU call to a non-inlined function.
If you have concerns about a specific example, you can create a short test code and look at the generated assembly language (SASS) by using cuobjdump -sass myexe and determine whether or not the function was inlined.
There are no general restrictions on inlining of __device__ functions that are class members/methods.

Related

C++ Low latency Design: Function Dispatch v/s CRTP for Factory implementation

As part of a system design, we need to implement a factory pattern. In combination with the Factory pattern, we are also using CRTP, to provide a base set of functionality which can then be customized by the Derived classes.
Sample code below:
class FactoryInterface{
public:
virtual void doX() = 0;
};
//force all derived classes to implement custom_X_impl
template< typename Derived, typename Base = FactoryInterface>
class CRTP : public Base
{
public:
void doX(){
// do common processing..... then
static_cast<Derived*>(this)->custom_X_impl();
}
};
class Derived: public CRTP<Derived>
{
public:
void custom_X_impl(){
//do custom stuff
}
};
Although this design is convoluted, it does a provide a few benefits. All the calls after the initial virtual function call can be inlined. The derived class custom_X_impl call is also made efficiently.
I wrote a comparison program to compare the behavior for a similar implementation (tight loop, repeated calls) using function pointers and virtual functions. This design came out triumphs for gcc/4.8 with O2 and O3.
A C++ guru however told me yesterday, that any virtual function call in a large executing program can take a variable time, considering cache misses and I can achieve a potentially better performance using C style function table look-ups and gcc hotlisting of functions. However I still see 2x the cost in my sample program mentioned above.
My questions are as below:
1. Is the guru's assertion true? For either answers, are there any links I can refer.
2. Is there any low latency implementation which I can refer, has a base class invoking a custom function in a derived class, using function pointers?
3. Any suggestions on improving the design?
Any other feedback is always welcome.

Your guru refers to the hot attribute of the gcc compiler. The effect of this attribute is:
The function is optimized more aggressively and on many targets it is
placed into a special subsection of the text section so all hot
functions appear close together, improving locality.
So yes, in a very large code base, the hotlisted function may remain in cache ready to be executed without delay, because it avodis cache misses.
You can perfectly use this attribute for member functions:
struct X {
void test() __attribute__ ((hot)) {cout <<"hello, world !\n"; }
};
But...
When you use virtual functions the compiler generally generates a vtable that is shared between all objects of the class. This table is a table of pointers to functions. And indeed -- your guru is right -- nothing garantees that this table remains in cached memory.
But, if you manually create a "C-style" table of function pointers, the problem is EXACTLY THE SAME. While the function may remain in cache, nothing ensures that your function table remains in cache as well.
The main difference between the two approaches is that:
in the case of virtual functions, the compiler knows that the virtual function is a hot spot, and could decide to make sure to keep the vtable in cache as well (I don't know if gcc can do this or if there are plans to do so).
in the case of the manual function pointer table, your compiler will not easily deduce that the table belongs to a hot spot. So this attempt of manual optimization might very well backfire.
My opinion: never try to optimize yourself what a compiler can do much better.
Conclusion
Trust in your benchmarks. And trust your OS: if your function or your data is frequently acessed, there are high chances that a modern OS will take this into account in its virtual memry management, and whatever the compiler will generate.

optimising function pointers and virtual functions in C++

I know a lot of questions on these topics have been asked before, but I have a specific case where speed is (moderately) important and the speed increase when using function pointers rather than virtual functions is about 25%. I wondered (for mostly academic reasons) why?
To give more detail, I am writing a simulation which consists of a series of Cells. The Cells are connected by Links which dictate the interactions between the Cells. The Link class has a virtual function called update(), which causes the two Cells it is linking to interact. It is virtual so that I can create different types of Links to give different types of interactions. For example, at the moment I am simulating inviscid flow, but I may want a link which has viscosity or applies a boundary condition.
The second way I can achieve the same affect is to pass a function pointer to the Link class and make the target of the function pointer a friend. I can now have a non-virtual update() which uses the function pointer. Derived classes can use pointers to different functions giving polymorphic behaviour.
When I build the two versions and profile with Very Sleepy, I find that the function pointer version is significantly faster than the virtual function version and that it appears that the Link class has been entirely optimised away - I just see calls from my main function to the functions pointed to.
I just wondered what made it easier for my compiler (MSVC++ 2012 Express) to optimise the function pointer case better than the virtual function case?
Some code below if it helps for the function pointer case, I'm sure it is obvious how the equivalent would be done with virtual functions.
void InviscidLinkUpdate( void * linkVoid )
{
InviscidLink * link=(InviscidLink*)linkVoid;
//do some stuff with the link
//e.g.
//link->param1=
}
void ViscousLinkUpdate( void * linkVoid )
{
ViscousLink * link=(ViscousLink*)linkVoid;
//do some stuff with the link
//e.g.
//link->param1=
}
class Link
{
public:
Link(Cell *cell1, Cell*cell2, float area, float timeStep, void (*updateFunc)( void * ))
:m_cell1(cell1), m_cell2(cell2), m_area(area), m_timeStep(timeStep), m_update(updateFunc)
~Link(){};
void update() {m_update( this );}
protected:
void (*const m_update)( void *, UNG_FLT );
Cell *m_cell1;
Cell *m_cell2;
float m_area;
float m_timeStep
//some other parameters I want to modify in update()
float param1;
float param2;
};
class InviscidLink : public Link
{
friend void InviscidLinkUpdate( void * linkVoid )
public:
InviscidLink(Cell *cell1, Cell*cell2, float area, float timeStep)
Link(cell1, cell2, area, timeStep, InvicedLinkUpdate)
{}
};
class ViscousLink : public Link
{
friend void ViscousLinkUpdate( void * linkVoid )
public:
ViscousLink(Cell *cell1, Cell*cell2, float area, float timeStep)
Link(cell1, cell2, area, timeStep, ViscousLinkUpdate)
{}
};
edit
I have now put the full source on GitHub - https://github.com/philrosenberg/ung
Compare commit 5ca899d39aa85fa3a86091c1202b2c4cd7147287 (the function pointer version) with commit aff249dbeb5dfdbdaefc8483becef53053c4646f (the virtual function version). Unfortunately I based the test project initially on a wxWidgets project in case I wanted to play with some graphical display so if you don't have wxWidgets then you will need to hack it into a command line project to compile it.
I used Very Sleepy to benchmark it
further edit:
milianw's comment about profile guided optimization turned out to be the solution, but as a comment I currently cannot mark it as the answer. Using the pro version of Visual Studio with the profile guided optimization gave similar runtimes as using inline functions. I guess this is the Virtual Call Speculation described at http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx. I still find it a bit odd that this code could be more easily optimized using function pointers instead of virtual functions, but I guess that is why everyone advises to TEST, rather than assume certain code is faster than another.

Two things I can think about that differs when using function pointers vs virtual functions :
Your class size will be smaller since it won't have a vftable allocated hence smaller size, more cache friendly
There's one indirection less with function pointer ( With virtual functions : Object Indirection, vftable indirection, virtual function indirection , with functors : Object indirection, functor indirection -> your update function is resolved at compile time, since it's not virtual)

As requested, here my comment again as an answer:
Try to use profile-guided optimizations here. Then, the profiler can potentially apply devirtualization to speed up the code. Also, don't forget to mark your implementations as final, which can further help. See e.g. http://channel9.msdn.com/Shows/C9-GoingNative/C9GoingNative-12-C-at-BUILD-2012-Inside-Profile-Guided-Optimization or the excellent GCC article series over at http://hubicka.blogspot.de/search/label/devirtualization.

The actual cost of a virtual function call is normally insignificant. However, virtual functions may, as you observed, impact the code speed considerably. The main reason is that normally a virtual function call is real function call - with adding a frame to the stack. It is so because virtual function calls are resolved in runtime.
If a function is not virtual, it is much easier for C++ compiler to inline it. The call is resolved in compilation time, so the compiler may replace the call with the body of the called function. This allows for much more aggressive optimisations - like doing some computations once only instead of each loop run, etc.

Based on the information provided here my best guess is that you're operating on a large number of objects, and that the one extra indirection induced by the virtual table is increasing cache misses to the point where the I/O to refetch from memory becomes measurable.
As another alternative have you considered using templates and either CRTP or a policy-based approach for the Link class? Depending on your needs it may be possible to entirely remove the dynamic dispatching.

Optimizing away function calls

Is it a conceivable that a C++ compiler would optimize out a function call to a class member function that only sets class variables? Example:
class A
{
private:
int foo;
public:
void bar(int foo_in)
{
foo = foo_in;
}
}
So if I did this
A test;
A.bar(5);
could a compiler optimize this to directly access the member and set it like so?

Yes, it is called inlining.
Moreover c++ is designed specifically to support, or make it easier for the compiler to perform, such optimizations in quite complex inheritance cases and templates.
Some would say this is quite distinctive feat of c++ as a high level language compared to others. Its "high level" features (mostly I mean generic programming - templates) were designed with such optimizations in mind. It is also one of the reasons that makes it considered as efficient in terms of performance.
This is also why would I expect a decent job at working out inlines with any reputable compiler.
From what I've read, this is also the reason why it is hard to get all the fancy stuff of other high-level languages such as Reflection mechanism, or other known from e.g. Java or python. It is because c++ is designed to easily allow to inline pretty much everything possible, so it's hard to introspect optimized code.
Edit:
Because you said you are writing an OpenGL stuff where performance of setters and getter and such optimizations do matter I decided to elaborate a bit and show a bit more interesting example where you can rely on inline mechanism.
You can write some interfaces avoiding the virtual mechanism but using templates. E.g:
//This is a stripped down interface for matrices for physical objects
//that have Hamiltonian and you can apply external field and temperature to it
template< class Object >
class Iface {
protected:
Object& t;
public:
Iface(Object& obj) : t(obj) {};
Vector get_eigen_vals() {return t.get_eigen_vals(); };
Matrix get_eigen_vectors() {return t.get_eigen_vectors(); };
void set_H(VectorD vect) { t.set_H(vect); };
void set_temp(double temp) {t.set_temp(temp);};
};
If you declare interface like this, you can wrap an object with this interface object, and pass instance of this interface class to your functions/algorithms, and still have everything inlined because it works on the reference of Object. Good compiler optimizes whole Iface out.

To answer the question a little bit more generally than just inlining:
There is something in the standard known as the as-if rule. It says that the compiler is allowed to make any change to your program as long as it doesn't affect the observable behaviour. There are even exemptions that allow them to change things that technically do change the observable behaviour.
It can elide function calls and even complete classes. It can do basically whatever it wants as long as it doesn't break anything.

Yes, the compiler can optimize this call away.
This is, actually very simple case of inlining.
The compiler is allowed to do much more tan that (it can unfold loops, optimize out local variables replace calculations with constants etc.)

C/C++: Is there a performance penalty for using the adapter pattern?

I am using a high performance/parallel graph library written in C in a C++ project. It provides a struct stinger (the graph data structure) and operations like
int stinger_insert_edge_pair (struct stinger *G,
int64_t type, int64_t from, int64_t to,
double weight, int64_t timestamp) { .... }
Most of the time, however, I do not want to specify timestamps or weights or types. Default parameters would be nice. Also, an OOP-like interface would be nice: G->insertEdge(u, v) instead of insert_edge_pair(G, u, v, ...).
So I was thinking of creating an adapter class looking like
class Graph {
protected:
stinger* stingerG;
public:
/** default parameters ***/
double defaultEdgeWeight = 1.0;
/** methods **/
Graph(stinger* stingerG);
virtual void insertEdge(node u, node v, double weight=defaultEdgeWeight);
};
The method insertEdge(...) simply calls stinger_insert_edge_pair(this->stingerG, ...) with the appropriate parameters.
However, performance is a crucial aspect here. What is the performance penalty associated with using such an adapter class? Should I expect degraded performance compared to using the "naked" library?

If your insertEgde just forwards the call to stinger_insert_edge_pair there would (most probably) be no difference in code generated between the plain call to stinger_insert_edge_pair and g->insertEdge (provided you remove the virtual specifier).
Comparing the assembly code that is generated through the plain call and adapter call would give a fair input on the overhead your adapter is bring in.
Does insertEdge have to be virtual? Are you planning to have subclasses of Graph? But again, cost of virtual function call is almost negligible compared to real cost the function execution itself.

If you use trivial inline methods, compiler should inline them at point of calling, so there won't be any performance penalty. However note, that you shouldn't use virtual functions for this.

A virtual function usually can't be inlined, so there is the same amount of overhead from a function call (pushing parameters on the stack, possible disruption to the pipeline and cache, etc). In practice, a routine function call is very fast--on the order of clock cycles. The only way to know for sure whether this is appropriate is to test on your own application.

Virtual Function Implementation

I have kept hearing this statement. Switch..Case is Evil for code maintenance, but it provides better performance(since compiler can inline stuffs etc..). Virtual functions are very good for code maintenance, but they incur a performance penalty of two pointer indirections.
Say i have a base class with 2 subclasses(X and Y) and one virtual function, so there will be two virtual tables. The object has a pointer, based on which it will choose a virtual table. So for the compiler, it is more like
switch( object's function ptr )
{
case 0x....:
X->call();
break;
case 0x....:
Y->call();
};
So why should virtual function cost more, if it can get implemented this way, as the compiler can do the same in-lining and other stuff here. Or explain me, why is it decided not to implement the virtual function execution in this way?
Thanks,
Gokul.

The compiler can't do that because of the separate compilation model.
At the time the virtual function call is being compiled, there is no way for the compiler to know for sure how many different subclasses there are.
Consider this code:
// base.h
class base
{
public:
virtual void doit();
};
and this:
// usebase.cpp
#include "base.h"
void foo(base &b)
{
b.doit();
}
When the compiler is generating the virtual call in foo, it has no knowledge of which subclasses of base will exist at runtime.

Your question rests on misunderstandings about the way switches and virtual functions work. Rather than fill up this box with a long treatise on code generation, I'll give a few bullet points:
Switch statements aren't necessarily faster than virtual function calls, or inlined. You can learn more about the way that switch statements are turned into assembly here and here.
The thing that is slow about virtual function calls isn't the pointer lookups, it's the indirect branch. For complicated reasons having to do with the internal electronics of the CPU, for most modern processors it is faster to perform a "direct branch", where the destination address is encoded in the instruction, than an "indirect branch", where the address is computed at runtime. Virtual function calls and large switch statements are usually implemented as indirect branches.
In your example above, the switch is completely redundant. Once an object's member function pointer has been computed, the CPU can branch straight to it. Even if the linker was aware of every possible member object that existed in the executable, it would still be unnecessary to add that table lookup.

Here's some results from concrete tests. These particular results are from VC++ 9.0/x64:
Test Description: Time to test a global using a 10-way if/else if statement
CPU Time: 7.70 nanoseconds plus or minus 0.385
Test Description: Time to test a global using a 10-way switch statement
CPU Time: 2.00 nanoseconds plus or minus 0.0999
Test Description: Time to test a global using a 10-way sparse switch statement
CPU Time: 3.41 nanoseconds plus or minus 0.171
Test Description: Time to test a global using a 10-way virtual function class
CPU Time: 2.20 nanoseconds plus or minus 0.110
With sparse cases, the switch statement is substantially slower. With dense cases, the switch statement might be faster, but the switch and the virtual function dispatch overlap a bit, so while the switch is probably faster, the margin is so small we can't even be sure it is faster, not to mention being enough faster to care much about. If the cases in the switch statement are sparse at all, there's no real question that the virtual function call will be faster.

There is no branching in virtual dispatch. The vptr in your class points to a vtable, with a second pointer for the specific function at a constant offset.

Actually, if you'll have many virtual functions, switch-like branching will be slower than two pointer indirection. Performance of the current implementation doesn't depend on how many virtual functions you have.

Your statement about the branching when calling virtual function is wrong. There is not such thing in generated code. Take a look at the assembly code will give you a better idea.
In a nut shell, one general simplified implementation of C++ virtual function is: each class will have a virtual table (vbtl), and each instance of the class will have a virtual table pointer (vptr). The virtual table is basically a list of function pointers.
When you are calling a virtual function, say it is like:
class Base {};
class Derived {};
Base* pB = new Derived();
pB->someVirtualFunction();
The 'someVirtualFunction()' will have a corresponding index in the vtbl. And the call
pB->someVirtualFunction();
will be converted to something like:
pB->vptr[k](); //k is the index of the 'someVirtualFunction'.
In this way the function is actually called indirectly and it has the polymorphism.
I suggest you to read 'The C++ Object Model' by Stanley Lippman.
Also, the statement that virtual function call is slower than the switch-case is not accruate. It depends. As you can see above, a virtual function call is just 1 extra time of dereference compared to a regular function call. And with switch-case branching you'd have extra comparision logic (which introduce the chance of missing cache for CPU) which also consumes CPU cycles. I would say in most case, if not all, virutal function call should be faster than switch-case.

Saying definitively that switch/case is more or less performant than virtual calls is an over-generalization. The truth is that this will depend on many things, and will vary based on:
what compiler you are using
what optimizations are enabled
the overall characteristics of your program and how they impact those optimizations
If you are optimizing the code in your head as you write it, there is a good chance that you are making the wrong choice. Write the code in a human-readable and/or user-friendly way first, then run the entire executable through profiling tools. If this area of the code shows up as a hotspot, then try it both ways and see which is quantifiably better for your particular case.

These kind of optimizations are possible only by a repatching linker which should run as a part of C++ runtime.
The C++ runtime is more complex that even a new DLL load (with COM) will add new function pointers to the vtable. (think about pure virtual fns?)
Then compiler or linker both cannot do this optimization. switch/case is obviously faster than indirect call since prefetch in CPU is deterministic and pipelining is possible. But it will not work out in C++ because of this runtime extension of object's vtable.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA device function as class member: Inlining and performance? - c++

Related

C++ Low latency Design: Function Dispatch v/s CRTP for Factory implementation

optimising function pointers and virtual functions in C++

Optimizing away function calls

C/C++: Is there a performance penalty for using the adapter pattern?

Virtual Function Implementation

Categories

Resources