How to properly implement an execute-function-on-each-element with CUDA?

How to properly implement an execute-function-on-each-element with CUDA? - c++

I have a class representing one or several containers of objects. The class offers a function to run a callback for each of the elements. A simple implementation could look like:
struct MyData{
Foo* foo;
void doForAllFoo(std::function<void(Foo)> fct){
for( /* all indices i in foo */){
fct(f[i]);
}
}
}
Driving code:
MyData d = MyData(...);
TypeX param1 = create_some_param();
TypeY param2 = create_some_more_param();
d.doForAll([&](Foo f) {my_function(f, param1, param2);});
I think this is a good solution for flexible callbacks on a container.
Now I'd like to parallelize this with CUDA. I'm not quite sure about what is allowed with lambdas in CUDA and I'm also not sure about compilation for __device__ and __host__.
I can (and will probably have to) change MyData, but I'd like to have no trace of the CUDA background in the driving code, except that I have to allocate memories in a CUDA-accessible way of course.
I think a minimal example would be very helpful.

Before you start to write the C style CUDA kernel function, you could check Thrust library. It is part of the CUDA and provide high level abstract for simple GPU algorithm development.
Here is a code example to show the use of function object and lamda expression with thrust.
https://github.com/thrust/thrust/blob/master/examples/lambda.cu
Even with Thrust, you still need to use __device__ and __host__ to ask the compiler to generate device code and host code for you. Since there's no place to put them in standard C++ lamda expression, you probably need to write longer code.

Related

How to integrate CUDA into an existing class structure?

I have a working CPU-based implementation of a simple deep learning framework where the main components are nodes of a computation graph which can perform computations on tensors.
Now I need to extend my implementation to GPU, I would like to use the existing class structure and only extend its functionality to GPU however, I'm not sure if that's even possible.
Most of the classes have methods that work on and return tensors such as:
tensor_ptr get_output();
where tensor_ptr is simply std::shared_ptr pointer of my tensor class. Now what I would like to do is to add a GPU version for each such method. The idea that I had in mind was to define a struct in a separate file tensor_gpu.cuh as follows
struct cu_shape {
int n_dims;
int x,y,z;
int len;
};
struct cu_tensor {
__device__ float * array;
cu_shape shape;
};
and then the previous function would be mirrored by:
cu_tensor cu_get_output();
The problem seems to be that the .cuh file gets treated as a regular header file and is compiled by the default c++ compiler and gives error:
error: attribute "device" does not apply here
on the line with the definition of __device__ float * array.
I am aware that you cannot mix CUDA and pure C++ code so I planned to hide all the CUDA runtime api functions into .cu files which would be defined in .h files. The problem is that I wanted to store the device pointers within my class and then pass those to the CUDA-calling functions.
This way I could still use all of my existing object structure and only modify the initialization and computation parts.
If a regular c++ class cannot touch anything with __device__ flag then how can you even integrate CUDA code into C++ code?
Can you only use CUDA runtime calls and keywords literally just in .cu files?
Or is there some smart way to hide the fact from c++ compiler that it is dealing with CUDA pointers?
Any insight is deeply appreciated!
EDIT: There seems to be a misunderstanding on my part. You don't need to put the __device__ flag and you'll still be able to use it as a pointer to device memory. If you have something valuable to add to good practices on CUDA integration or clarify something else, don't hesitate!

'__' is reserved for implementation purposes. That is why the Nvidia implementation can use __device__. But the other "regular" C++ implementation has its own reserved symbols.
In hindsight Nvidia could have designed a better solution but that is not going to help you here.

C++ Template Classes and Matlab Mex

I'm wondering if there is an elegant way to write a Matlab/Mex interface to templated C++ code.
Here is the situation: say I have written a very nice set of C++ template classes like so:
template<FooType T>
Foo{
private:
//Some data
public:
//Constructors, methods
};
template<BarType S>
Bar{
private:
//Some data
public:
//Constructors, methods
};
template<FooType T, BarType S>
MyClass{
private:
Foo<T> *D;
Bar<S> *M;
public:
//Constructors, methods
};
If I have say, n different FooTypes and m different BarTypes, I have n*m different parameter choices for MyClass (FooType and BarType are custom typenames, incedentally). Now, if I was writing a C++ program to use these classes, it would look very simple:
int main()
{
Foo<fooType1> *F = new Foo<fooType1>(params);
Bar<barType3> *B = new Bar<barType3>(params);
MyClass<fooType1,barType3> *M = new MyClass(F,B);
M->doThing();
return 0;
}
I compile main.cpp and run it and rejoice. This works because I have selected template parameters at compile time, as per C++ template design. This works very well for me so far, and I like the design.
Now suppose I want to write the same type of program as main.cpp, but using Matlab as the scripting language and a Mex interface to my classes.hpp. The main reason for wanting to do this is that my code is an add-on to an exisiting matlab package. Is there an elegant way to write my mex file so that I have access to every possible pair of template parameters? What I have started to do is write my interface file with a lot of switch statements so that I can select FooType and BarType - essentially the Mex file compiles every possible (n*m) class instance and leaves them sitting there for Matlab to use. This seems OK for now (I have n=3, m=2), but it also seems sloppy and difficult to maintain.
I have thought about making the "user" re-compile the mex file every time they want to choose a different FooType and BarType, but this also seems a bit irritating (to the average Matlab user anyway). Thanks for your input!

I'm not familiar with Mex; I'm just a C++ user. What you describe is not possible without direct MATLAB support for run-time library generation, which I don't think is even a thing that exists. You will definitely need to pre-instantiate everything at compile time. This is not simple.
The problem, as you have said, is that the templates are all generated at compile time, and they aren't used until runtime, when all the type information is gone. The mangled names shouldn't be an issue, assuming MATLAB knows the mangling scheme.
So you'll have to solve this at the C++ level. In C++, there's no type-safe way to get the name of a type as a string. The only solution is to use macros to eliminate some of the bloat. This is unfortunate, because if it were possible, a very elegant solution would be possible with a std::tuple of all your valid parameter types, and a little recursive template magic.

CUDA device functors factory

Let say there is a C++ functor:
class Dummy
{
public:
int operator() (const int a, const int b)
{
return a+b;
}
};
This functor doesn't use any function that can't execute on GPU but it can't be called from CUDA kernel cause there is no __device__ declaration in front of operator(). I would like to create factory class that converts such functors to device compatible functors that can be called within CUDA kernel. For example:
Dummy d;
auto cuda_d = CudaFunctorFactory.get(d);
Can this be accomplished in any way? Feel free to add some constraints as long as it can be accomplished...

The one word answer is no, this isn't possible.
There is no getting around the fact that in the CUDA compilation model, any method code contained in a class or structure which will execute on the GPU must be statically declared and defined at compile time. Somewhere in that code, there has to be a __device__ function available during compilation, otherwise the compilation fails. That is a completely non-negotiable cornerstone of CUDA as it exists today.
A factory design pattern can't sidestep that requirement. Further, I don't think it is possible to implement a factory for GPU instances in host code because there still isn't any way of directly accessing __device__ function pointers from the host, and no way of directly instantiating a GPU class from the host because the constructor must execute on the GPU. At the moment, the only program units which the host can run on the GPU are __global__ functions (ie. kernels), and these cannot be contained within classes. In CUDA, GPU classes passed by argument must be concretely defined, virtual methods aren't supported (and there is not RTTI). That eliminates all the paths I can think of to implement a factory in CUDA C++ for the GPU.
In summary, I don't see any way to make magic that can convert host code to device code at runtime.

Is it idiomatically ok to put algorithm into class?

I have a complex algorithm. This uses many variables, calculates helper arrays at initialization and also calculates arrays along the way. Since the algorithm is complex, I break it down into several functions.
Now, I actually do not see how this might be a class from an idiomatic way; I mean, I am just used to have algorithms as functions. The usage would simply be:
Calculation calc(/* several parameters */);
calc.calculate();
// get the heterogenous results via getters
On the other hand, putting this into a class has the following advantages:
I do not have to pass all the variables to the other functions/methods
arrays initialized at the beginning of the algorithm are accessible throughout the class in each function
my code is shorter and (imo) clearer
A hybrid way would be to put the algorithm class into a source file and access it via a function that uses it. The user of the algorithm would not see the class.
Does anyone have valuable thoughts that might help me out?
Thank you very much in advance!

I have a complex algorithm. This uses many variables, calculates helper arrays at initialization and also calculates arrays along the way.[...]
Now, I actually do not see how this might be a class from an idiomatic way
It is not, but many people do the same thing you do (so did I a few times).
Instead of creating a class for your algorithm, consider transforming your inputs and outputs into classes/structures.
That is, instead of:
Calculation calc(a, b, c, d, e, f, g);
calc.calculate();
// use getters on calc from here on
you could write:
CalcInputs inputs(a, b, c, d, e, f, g);
CalcResult output = calculate(inputs); // calculate is now free function
// use getters on output from here on
This doesn't create any problems and performs the same (actually better) grouping of data.

I'd say it is very idiomatic to represent an algorithm (or perhaps better, a computation) as a class. One of the definitions of object class from OOP is "data and functions to operate on that data." A compex algorithm with its inputs, outputs and intermediary data matches this definition perfectly.
I've done this myself several times, and it simplifies (human) code flow analysis significantly, making the whole thing easier to reason about, to debug and to test.

If the abstraction for the client code is an algorithm, you
probably want to keep a pure functional interface, and not
introduce additional types there. It's quite common, on the
other hand, for such a function to be implemented in a source
file which defines a common data structure or class for its
internal use, so you might have:
double calculation( /* input parameters */ )
{
SupportClass calc( /* input parameters */ );
calc.part1();
calc.part2();
// etc...
return calc.results();
}
Depending on how your code is organized, SupportClass will be
in an unnamed namespace in the source file (probably the most
common case), or in a "private" header, included only by the
sources involved in the algorith.

It really depends of what kind of algorithm you want to encapsulate. Generally I agree with John Carmack : "Sometimes, the elegant implementation is just a function. Not a method. Not a class. Not a framework. Just a function."

It really boils down to: do the algorithm need access to the private area of the class that is not supposed to be public? If the answer is yes (unless you are willing to refactor your class interface, depending on the specific cases) you should go with a member function, if not, then a free function is good enough.
Take for example the standard library. Most of the algorithms are provided as free functions because they only access the public interface of the class (with iterators for standard containers, for example).

Do you need to call the exact same functions in the exact same order each time? Then you shouldn't be requiring calling code to do this. Splitting your algorithm into multiple functions is fine, but I'd still have one call the next and then the next and so on, with a struct of results/parameters being passed along the way. A class doesn't feel right for a one-off invocation of some procedure.
The only way I'd do this with a class is if the class encapsulates all the input data itself, and you then call myClass.nameOfMyAlgorithm() on it, among other potential operations. Then you have data+manipulators. But just manipulators? Yeah, I'm not so sure.

In modern C++ the distinction has been eroded quite a bit. Even from the operator overloading of the pre-ANSI language, you could create a class whose instances are syntactically like functions:
struct Multiplier
{
int factor_;
Multiplier(int f) : factor_(f) { }
int operator()(int v) const
{
return v * _factor;
}
};
Multipler doubler(2);
std::cout << doubler(3) << std::endl; // prints 6
Such a class/struct is called a functor, and can capture "contextual" values in its constructor. This allows you to effectively pass the parameters to a function in two stages: some in the constructor call, some later each time you call it for real. This is called partial function application.
To relate this to your example, your calculate member function could be turned into operator(), and then the Calculation instance would be a function! (or near enough.)
To unify these ideas, you can try thinking of a plain function as a functor of which there is only one instance (and hence no need for a constructor - although this is no guarantee that the function only depends on its formal parameters: it might depend on global variables...)
Rather than asking "Should I put this algorithm in a function or a class?" instead ask yourself "Would it be useful to be able to pass the parameters to this algorithm in two or more stages?" In your example, all the parameters go into the constructor, and none in the later call to calculate, so it makes little sense to ask users of your class make two calls.
In C++11 the distinction breaks down further (and things get a lot more convenient), in recognition of the fluidity of these ideas:
auto doubler = [] (int val) { return val * 2; };
std::cout << doubler(3) << std::endl; // prints 6
Here, doubler is a lambda, which is essentially a nifty way to declare an instance of a compiler-generated class that implements the () operator.
Reproducing the original example more exactly, we would want a function-like thing called multiplier that accepts a factor, and returns another function-like thing that accepts a value v and returns v * factor.
auto multiplier = [] (int factor)
{
return [=] (int v) { return v * factor; };
};
auto doubler = multiplier(2);
std::cout << doubler(3) << std::endl; // prints 6
Note the pattern: ultimately we're multiplying two numbers, but we specify the numbers in two steps. The functor we get back from calling multiplier acts like a "package" containing the first number.
Although lambdas are relatively new, they are likely to become a very common part of C++ style (as they have in every other language they've been added to).
But sadly at this point we've reached the "cutting edge" as the above example works in GCC but not in MSVC 12 (I haven't tried it in MSVC 13). It does pass the intellisense checking of MSVC 12 though (they use two completely different compilers)! And you can fix it by wrapping the inner lambda with std::function<int(int)>( ... ).
Even so, you can use these ideas in old-school C++ when writing functors by hand.
Looking further ahead, resumable functions may make it into some future version of the language (Microsoft is pushing hard for them as they are practically identical to async/await in C#) and that is yet another blurring of the distinction between functions and classes (a resumable function acts like a constructor for a state machine class).

Serializing function objects

Is it possible to serialize and deserialize a std::function, a function object, or a closure in general in C++? How? Does C++11 facilitate this? Is there any library support available for such a task (e.g., in Boost)?
For example, suppose a C++ program has a std::function which is needed to be communicated (say via a TCP/IP socket) to another C++ program residing on another machine. What do you suggest in such a scenario?
Edit:
To clarify, the functions which are to be moved are supposed to be pure and side-effect-free. So I do not have security or state-mismatch problems.
A solution to the problem is to build a small embedded domain specific language and serialize its abstract syntax tree.
I was hoping that I could find some language/library support for moving a machine-independent representation of functions instead.

Yes for function pointers and closures. Not for std::function.
A function pointer is the simplest — it is just a pointer like any other so you can just read it as bytes:
template <typename _Res, typename... _Args>
std::string serialize(_Res (*fn_ptr)(_Args...)) {
return std::string(reinterpret_cast<const char*>(&fn_ptr), sizeof(fn_ptr));
}
template <typename _Res, typename... _Args>
_Res (*deserialize(std::string str))(_Args...) {
return *reinterpret_cast<_Res (**)(_Args...)>(const_cast<char*>(str.c_str()));
}
But I was surprised to find that even without recompilation the address of a function will change on every invocation of the program. Not very useful if you want to transmit the address. This is due to ASLR, which you can turn off on Linux by starting your_program with setarch $(uname -m) -LR your_program.
Now you can send the function pointer to a different machine running the same program, and call it! (This does not involve transmitting executable code. But unless you are generating executable code at run-time, I don't think you are looking for that.)
A lambda function is quite different.
std::function<int(int)> addN(int N) {
auto f = [=](int x){ return x + N; };
return f;
}
The value of f will be the captured int N. Its representation in memory is the same as an int! The compiler generates an unnamed class for the lambda, of which f is an instance. This class has operator() overloaded with our code.
The class being unnamed presents a problem for serialization. It also presents a problem for returning lambda functions from functions. The latter problem is solved by std::function.
std::function as far as I understand is implemented by creating a templated wrapper class which effectively holds a reference to the unnamed class behind the lambda function through the template type parameter. (This is _Function_handler in functional.) std::function takes a function pointer to a static method (_M_invoke) of this wrapper class and stores that plus the closure value.
Unfortunately, everything is buried in private members and the size of the closure value is not stored. (It does not need to, because the lambda function knows its size.)
So std::function does not lend itself to serialization, but works well as a blueprint. I followed what it does, simplified it a lot (I only wanted to serialize lambdas, not the myriad other callable things), saved the size of the closure value in a size_t, and added methods for (de)serialization. It works!

No.
C++ has no built-in support for serialization and was never conceived with the idea of transmitting code from one process to another, lest one machine to another. Languages that may do so generally feature both an IR (intermediate representation of the code that is machine independent) and reflection.
So you are left with writing yourself a protocol for transmitting the actions you want, and the DSL approach is certainly workable... depending on the variety of tasks you wish to perform and the need for performance.
Another solution would be to go with an existing language. For example the Redis NoSQL database embeds a LUA engine and may execute LUA scripts, you could do the same and transmit LUA scripts on the network.

No, but there are some restricted solutions.
The most you can hope for is to register functions in some sort of global map (e.g. with key strings) that is common to the sending code and the receiving code (either in different computers or before and after serialization).
You can then serialize the string associated with the function and get it on the other side.
As a concrete example the library HPX implements something like this, in something called HPX_ACTION.
This requires a lot of protocol and it is fragile with respect to changes in code.
But after all this is no different from something that tries to serialize a class with private data. In some sense the code of the function is its private part (the arguments and return interface is the public part).
What leaves you a slip of hope is that depending on how you organize the code these "objects" can be global or common and if all goes right they are available during serialization and deserialization through some kind predefined runtime indirection.
This is a crude example:
serializer code:
// common:
class C{
double d;
public:
C(double d) : d(d){}
operator(double x) const{return d*x;}
};
C c1{1.};
C c2{2.};
std::map<std::string, C*> const m{{"c1", &c1}, {"c2", &c2}};
// :common
main(int argc, char** argv){
C* f = (argc == 2)?&c1:&c2;
(*f)(5.); // print 5 or 10 depending on the runtime args
serialize(f); // somehow write "c1" or "c2" to a file
}
deserializer code:
// common:
class C{
double d;
public:
operator(double x){return d*x;}
};
C c1;
C c2;
std::map<std::string, C*> const m{{"c1", &c1}, {"c2", &c2}};
// :common
main(){
C* f;
deserialize(f); // somehow read "c1" or "c2" and assign the pointer from the translation "map"
(*f)(3.); // print 3 or 6 depending on the code of the **other** run
}
(code not tested).
Note that this forces a lot of common and consistent code, but depending on the environment you might be able to guarantee this.
The slightest change in the code can produce a hard to detect logical bug.
Also, I played here with global objects (which can be used on free functions) but the same can be done with scoped objects, what becomes trickier is how to establish the map locally (#include common code inside a local scope?)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to properly implement an execute-function-on-each-element with CUDA? - c++

Related

How to integrate CUDA into an existing class structure?

C++ Template Classes and Matlab Mex

CUDA device functors factory

Is it idiomatically ok to put algorithm into class?

Serializing function objects

Categories

Resources