How not to repeat myself without macros when writing similar CUDA kernels? - c++

I have several CUDA Kernels which are basically doing the same with some variations. What I would like to do is to reduce the amout of code needed. My first thought was to use macros, so my resulting kernels would look like this (simplified):
__global__ void kernelA( ... )
{
INIT(); // macro to initialize variables
// do specific stuff for kernelA
b = a + c;
END(); // macro to write back the result
}
__global__ void kernelB( ... )
{
INIT(); // macro to initialize variables
// do specific stuff for kernelB
b = a - c;
END(); // macro to write back the result
}
...
Since macros are nasty, ugly and evil I am looking for a better and cleaner way. Any suggestions?
(A switch statement would not do the job: In reality, the parts which are the same and the parts which are kernel specific are pretty interweaved. Several switch statements would be needed which would make the code pretty unreadable. Furthermore, function calls would not initialize the needed variables. )
(This question might be answerable for general C++ as well, just replace all 'CUDA kernel' with 'function' and remove '__global__' )

Updated: I was told in the comments, that classes and inheritance don't mix well with CUDA. Therefore only the first part of the answer applies to CUDA, while the others are answer to the more general C++ part of your question.
For CUDA, you will have to use pure functions, "C-style":
struct KernelVars {
int a;
int b;
int c;
};
__device__ void init(KernelVars& vars) {
INIT(); //whatever the actual code is
}
__device__ void end(KernelVars& vars) {
END(); //whatever the actual code is
}
__global__ void KernelA(...) {
KernelVars vars;
init(vars);
b = a + c;
end(vars);
}
This is the answer for general C++, where you would use OOP techniques like constructors and destructors (they are perfectly suited for those init/end pairs), or the template method pattern which can be used with other languages as well:
Using ctor/dtor and templates, "C++ Style":
class KernelBase {
protected:
int a, b, c;
public:
KernelBase() {
INIT(); //replace by the contents of that macro
}
~KernelBase() {
END(); //replace by the contents of that macro
}
virtual void run() = 0;
};
struct KernelAdd : KernelBase {
void run() { b = a + c; }
};
struct KernelSub : KernelBase {
void run() { b = a - c; }
};
template<class K>
void kernel(...)
{
K k;
k.run();
}
void kernelA( ... ) { kernel<KernelAdd>(); }
Using template method pattern, general "OOP style"
class KernelBase {
virtual void do_run() = 0;
protected:
int a, b, c;
public:
void run() { //the template method
INIT();
do_run();
END();
}
};
struct KernelAdd : KernelBase {
void do_run() { b = a + c; }
};
struct KernelSub : KernelBase {
void do_run() { b = a - c; }
};
void kernelA(...)
{
KernelAdd k;
k.run();
}

You can use device functions as "INIT()" and "END()" alternative.
__device__ int init()
{
return threadIdx.x + blockIdx.x * blockDim.x;
}
Another alternative is to use function templates:
#define ADD 1
#define SUB 2
template <int __op__> __global__ void caluclate(float* a, float* b, float* c)
{
// init code ...
switch (__op__)
{
case ADD:
c[id] = a[id] + b[id];
break;
case SUB:
c[id] = a[id] - b[id];
break;
}
// end code ...
}
and invoke them using:
calcualte<ADD><<<...>>>(a, b, c);
The CUDA compiler does the work, build the different function versions and removes the dead code parts for performance optimization.

Related

Access member variables based on ENUM (no runtime decision)

My goal is to speed up my code by removing the runtime decision making of if-statements. Here's a simple example:
enum E
{
E_A,
E_B
};
class Example:
{
public:
void DoSomething(E var)
{
if (var == E_A) {
// Do stuff with a
} else if (var == E_B) {
// Do stuff with b
}
}
private:
set<int> a;
set<int> b;
}
I've split the above DoSomething() function into 2.
void DoSomething(E::E_A var) {
// Do stuff with a
}
void DoSomething(E::E_B var) {
// Do stuff with b
}
The problem is I have to rewrite the same code twice... We could use an inbetween function:
void DoSomething(E_A var) {
InBetween(a);
}
void DoSomething(E_B var) {
InBetween(b);
}
void InBetween(set<int> s)
{
// Do something with s
}
set<int> a;
set<int> b;
However I was wondering if there's a way to achieve what I want with just a single DoSomething()? I want the code to have minimal runtime so please avoid solutions like storing storing a and b inside a map and looking up based on key.
Or please suggest me some areas to look into. Thanks in advance!
You can make DoSomething template (and var template parameter). Then you can apply Constexpr If (since C++17), which performs dispatch at compile-time. E.g.
template <E var>
void DoSomething()
{
if constexpr (var == E_A) {
// Do stuff with a
} else if constexpr (var == E_B) {
// Do stuff with b
}
}
Then call it as
a_Example.DoSomething<E::E_A>();
a_Example.DoSomething<E::E_B>();

C++ calling a vector of classes' method in another class

class C {
public: void c_set(int x){ a = x; }
private: int a;
}
;
class U {
public: void load();
c_loader(int i, int x){ c[i].c_set(x); };
private: vector<C> c(20);
}
;
void U::load() {
int x;
cin >> x >> i;
c_loader(i, x)
}
I'm really confused with this one. I need to call a member function in another one but my problem is that the inside class is a vector of that classes. My code is supposed to work but the result is segfault. Presume that the function cget has definition.
The question is a bit unclear but try this to prevent segfault.
class C {
public: void cget(int a);
private: int a;
};
class U {
public: void load();
vector<C> c; // Note: c is made public in order to add elements from main
};
void U::load(unsigned x, int a) {
if (x < c.size()) // Check the size of c _before_ access
{
c[x].cget(a);
}
}
void main()
{
U u;
C c;
u.c.push_back(c);
u.load(0, 3); // Will end up calling cget
u.load(1, 3); // Will just return without calling cget
}
EDIT:
Just want to mention that the code in the question has changed a lot sinse my answer. That explains why my code looks quite different ;-)
In any case, the answer is still: Check the size of c before accessing it.

Pointer to function from a member's function

I have a bit complex situation. It may can easily be solved by inheritance, but I am now curious and due some other reasons I would solve it this way.
I have class which represents an algorithm and with time a different solution has been implemented. Currently, I have the original class, and it's member is the new one.
I would have both of the algorithm and a switch to be able to use them according to the situation.
#include "B.h"
class A {
public:
typedef void ( A::*FooFunction )( float, float );
A::FooFunction m_fooFunction;
B m_b;
A( WhichUseEnum algorithm ) : m_b( B() )
{
switch( algorithm ) {
case ALG_A:
m_fooFunction = &A::FooFunctionOfA;
break;
case ALG_B:
m_fooFunction = ??? // something like: m_b.FooFunctionOfA
break;
}
}
void FooFunction( float a , float b)
{
( this->*m_fooFunction )( a, b );
}
void FooFunctionOfA( float, float ); // implementation at the .cpp
};
class B {
public:
void FooFunctionOfB( float, float );
}
As you can see, I want to save the pointer to function of the member m_b and call it as the FooFunction does. With the own function ( FooFunctionOfA() ) I was already successful, but the other is much harder. I tried several idea, but I could not find the version which was accepted by the compiler. :)
I found a similar question where the solution looked like this: &m_b.*m_b.FooFunctionOfB and at this point I gave it up.
If anybody has some idea, pleas do not hesitate to share with me.
I am using C++ but not C++0x AND I am forced to avoid stl and boost.
You need to use std::tr1::function. This class is built for the purpose you need. It can accept any function, member function, etc.
class A {
public:
std::tr1::function<void(float, float)> m_fooFunction;
B m_b;
A( WhichUseEnum algorithm ) : m_b( B() )
{
switch( algorithm ) {
case ALG_A:
m_fooFunction = std::tr1::bind(&A::FooFunctionOfA, this);
break;
case ALG_B:
m_fooFunction = std::tr1::bind(&A::FooFunctionOfA, &m_b);
break;
}
}
void FooFunction( float a , float b)
{
m_fooFunction( a, b );
}
void FooFunctionOfA( float, float ); // implementation at the .cpp
};
class B {
public:
void FooFunctionOfB( float, float );
}
Here I have used std::tr1::bind to define the two functions. As you can see the calling syntax is much easier too- just like a regular function call. std::tr1::bind can bind a lot more than just member functions and member function pointers too. Gah, it's been a while since I had to use bind instead of lambdas.
The general rule of C++ is that if you're using function pointers or member function pointers and you're not interfacing to some old code then you're almost certainly doing it wrong. This is no exception. If you're pre-C++0x then you may need to bind them too but that's about it.
If you're using a compiler so old, it doesn't even have TR1, you can use Boost to substitute these facilities- they were Standardised from Boost so the Boost equivalent is very close in functionality and interface.
You can provide a wrapper function that does the job.
#include "B.h"
class A {
public:
typedef void ( A::*FooFunction )( float, float );
A::FooFunction m_fooFunction;
B m_b;
A( WhichUseEnum algorithm ) : m_b( B() )
{
switch( algorithm ) {
case ALG_A:
m_fooFunction = &A::FooFunctionOfA;
break;
case ALG_B:
m_fooFunction = &A::FooFunctionOfB;
break;
}
}
void FooFunction( float a , float b)
{
( this->*m_fooFunction )( a, b );
}
void FooFunctionOfA( float, float ); // implementation at the .cpp
// A wrapper function that redirects the call to B::fooFunctionOfB().
void FooFunctionOfB( float a, float b)
{
this->m_b.FooFunctionOfB(a, b);
}
};
class B {
public:
void FooFunctionOfB( float, float );
}
What you should have done:
(1) Extract Interface on the original algorithm. Now the original algorithm is an implementation of this virtual interface.
class FooFunctionInterface
{
public:
virtual void Foo(float a, float b) = 0;
};
class OriginalFoo : public FooFunctionInterface
{
public:
void Foo(float a, float b) override
{
/* ... original implementation ... */
}
};
(2) Introduce the new algorithm as an alternate implementation of the interface.
class NewFoo : public FooFunctionInterface
{
public:
void Foo(float a, float b) override
{
/* ... new implementation ... */
}
};
(3) Introduce a factory function for selecting which implementation to use.
class NullFoo : FooFunctionInterface
{
public:
void Foo(float a, float b) override {}
};
std::unique_ptr<FooFunctionInterface> FooFactory(WhichUseEnum which)
{
std::unique_ptr<FooFunctionInterface> algorithm(new NullFoo());
switch(which)
{
case ALG_A: algorithm.reset(new OriginalFoo()); break;
case ALG_B: algorithm.reset(new NewFoo()); break;
};
return algorithm;
}
Then your class A becomes pimpl idiom forwarding calls to the appropriate implementation.
class A
{
public:
A(WhichUseEnum which)
: pimpl_(FooFactory(which))
{
}
void Foo(float a, float b)
{
pimpl_->Foo(a, b);
}
private:
std::unique_ptr<FooFunctionInterface> pimpl_;
};
Hugely cleaner approach to the mess you've made. You know its cleaner when you consider what will happen when you need to add the third implementation, then the fourth implementation.
In my example, you extend the factory function and move on with your life. No other code changes.

C++ fast ways to bind member functions

I have a class transition and inside, a member function rate. I am asking for a method that would allow me to insert custom designed rates into instants of transition, after those instants have been created, and would be fast at run-time!
I would like to optimize the code for speed. rate does simple computations but is called very frequently and many times by the program. So I guess I should avoid virtual functions... Question: what are the other best methods to achieve this in C++ (templates,boost,...)? Comments about the speed of a particular method would be appreciated. Thanks!
class transition {
public:
transition() : vec() {}
double rate(T_vec::iterator a) { return ...; }
private:
T_vec vec;
};
/* custom rate #1 */
double my_rate_1( T_vec::iterator) {
/* do something */
return ans;
}
/* custom rate #2 */
double my_rate_2( T_vec::iterator) {
/* do something */
return ans;
}
const int N=10;
int main (void) {
vector<transition*> ts(N);
for(int i=0;i!=N;++i) ts[i] = new transition;
/* How to efficiently implement the pseudo code that follows? */
ts[0]->rate = my_rate_1;
ts[1]->rate = my_rate_2;
/* ... */
}
There are at least three ways to implement this.
Option 1 is virtual methods. You can't bind the method after you create the instance, but after the creation you can treat all the derived classes as transition.
class transition {
...
virtual double rate(T_vec::iterator a) = 0;
};
class my_transition_1 : public transition {
...
double rate(T_vec::iterator a) { ... }
};
class my_transition_2 : public transition {
...
double rate(T_vec::iterator a) { ... }
};
Option 2 is callbacks. You can change the method at runtime, after you created the object. It's the most dynamic. It has slightly higher overhead in this case, because there is an extra copy construction of the iterator, and it is harder for the compiler to optimize away the indirect call.
class transition {
public:
....
typedef double (*RateFunction)(T_vec::iterator a);
void set_rate(RateFunction r) { _fun = r; }
double rate(T_vec::iterator a) { return (*_fun)(a); }
private:
RateFunction _fun;
};
double my_rate_1(T_vec::iterator a) {
...
}
...
transition t;
t.set_rate(my_rate_1);
Option 3 is functor templates. You have to specify everything at construction time, but this avoids the indirect call and has the best performance.
template <typename Rate>
class transition {
double rate(T_vec::iterator a) {
return Rate()(a);
}
};
class my_rate_1 {
double operator()(T_vec::iterator a) {
....
}
};
class my_rate_2 {
double operator()(T_vec::iterator a) {
....
}
};
transition<my_rate_1> t1;
transition<my_rate_2> t2;
Option 4 is not extensible, but you avoid the indirect function call and have the opportunity to set the rate after creating the object.
class transition {
public:
enum RateCode {
RATE_1,
RATE_2,
...
}
double rate(T_vec::iterator i) {
switch (_rate_code) {
case RATE_1: {
...
return result;
}
case RATE_2: {
...
return result;
}
default:
assert(false);
}
}
void setRate(RateCode r) { _rate_code = r; }
private:
RateCode _rate_code;
}
If you want to bind to arbitrary functions, check the FastDelegate article. There is also an article of a more portable implementation of the delegate idea.
If you can arrange your code such that the specific instance is known at compile time, this will be faster, assuming the compiler does its job well. The reason why it is faster is that a true delegate implies a call to a function pointer, and that breaks the speculative execution and pipelining in today's CPU's.
You might also want to read up on C++11. In C++11, lambda functions (inline written functions that can be passed around) are an important extension, and I would expect compilers to work hard to optimize them.

raw function pointer from a bound method

I need to bind a method into a function-callback, except this snippet is not legal as discussed in demote-boostfunction-to-a-plain-function-pointer.
What's the simplest way to get this behavior?
struct C {
void m(int x) {
(void) x;
_asm int 3;
}};
typedef void (*cb_t)(int);
int main() {
C c;
boost::function<void (int x)> cb = boost::bind(&C::m, &c, _1);
cb_t raw_cb = *cb.target<cb_t>(); //null dereference
raw_cb(1);
return 0;
}
You can make your own class to do the same thing as the boost bind function. All the class has to do is accept the function type and a pointer to the object that contains the function. For example, this is a void return and void param delegate:
template<typename owner>
class VoidDelegate : public IDelegate
{
public:
VoidDelegate(void (owner::*aFunc)(void), owner* aOwner)
{
mFunction = aFunc;
mOwner = aOwner;
}
~VoidDelegate(void)
{}
void Invoke(void)
{
if(mFunction != 0)
{
(mOwner->*mFunction)();
}
}
private:
void (owner::*mFunction)(void);
owner* mOwner;
};
Usage:
class C
{
void CallMe(void)
{
std::cout << "called";
}
};
int main(int aArgc, char** aArgv)
{
C c;
VoidDelegate<C> delegate(&C::CallMe, &c);
delegate.Invoke();
}
Now, since VoidDelegate<C> is a type, having a collection of these might not be practical, because what if the list was to contain functions of class B too? It couldn't.
This is where polymorphism comes into play. You can create an interface IDelegate, which has a function Invoke:
class IDelegate
{
virtual ~IDelegate(void) { }
virtual void Invoke(void) = 0;
}
If VoidDelegate<T> implements IDelegate you could have a collection of IDelegates and therefore have callbacks to methods in different class types.
Either you can shove that bound parameter into a global variable and create a static function that can pick up the value and call the function on it, or you're going to have to generate per-instance functions on the fly - this will involve some kind of on the fly code-gen to generate a stub function on the heap that has a static local variable set to the value you want, and then calls the function on it.
The first way is simple and easy to understand, but not at all thread-safe or reentrant. The second version is messy and difficult, but thread-safe and reentrant if done right.
Edit: I just found out that ATL uses the code generation technique to do exactly this - they generate thunks on the fly that set up the this pointer and other data and then jump to the call back function. Here's a CodeProject article that explains how that works and might give you an idea of how to do it yourself. Particularly look at the last sample (Program 77).
Note that since the article was written DEP has come into existance and you'll need to use VirtualAlloc with PAGE_EXECUTE_READWRITE to get a chunk of memory where you can allocate your thunks and execute them.
#include <iostream>
typedef void(*callback_t)(int);
template< typename Class, void (Class::*Method_Pointer)(void) >
void wrapper( int class_pointer )
{
Class * const self = (Class*)(void*)class_pointer;
(self->*Method_Pointer)();
}
class A
{
public:
int m_i;
void callback( )
{ std::cout << "callback: " << m_i << std::endl; }
};
int main()
{
A a = { 10 };
callback_t cb = &wrapper<A,&A::callback>;
cb( (int)(void*)&a);
}
i have it working right now by turning C into a singleton, factoring C::m into C::m_Impl, and declaring static C::m(int) which forwards to the singleton instance. talk about a hack.