How to integrate CUDA into an existing class structure? - c++

I have a working CPU-based implementation of a simple deep learning framework where the main components are nodes of a computation graph which can perform computations on tensors.
Now I need to extend my implementation to GPU, I would like to use the existing class structure and only extend its functionality to GPU however, I'm not sure if that's even possible.
Most of the classes have methods that work on and return tensors such as:
tensor_ptr get_output();
where tensor_ptr is simply std::shared_ptr pointer of my tensor class. Now what I would like to do is to add a GPU version for each such method. The idea that I had in mind was to define a struct in a separate file tensor_gpu.cuh as follows
struct cu_shape {
int n_dims;
int x,y,z;
int len;
};
struct cu_tensor {
__device__ float * array;
cu_shape shape;
};
and then the previous function would be mirrored by:
cu_tensor cu_get_output();
The problem seems to be that the .cuh file gets treated as a regular header file and is compiled by the default c++ compiler and gives error:
error: attribute "device" does not apply here
on the line with the definition of __device__ float * array.
I am aware that you cannot mix CUDA and pure C++ code so I planned to hide all the CUDA runtime api functions into .cu files which would be defined in .h files. The problem is that I wanted to store the device pointers within my class and then pass those to the CUDA-calling functions.
This way I could still use all of my existing object structure and only modify the initialization and computation parts.
If a regular c++ class cannot touch anything with __device__ flag then how can you even integrate CUDA code into C++ code?
Can you only use CUDA runtime calls and keywords literally just in .cu files?
Or is there some smart way to hide the fact from c++ compiler that it is dealing with CUDA pointers?
Any insight is deeply appreciated!
EDIT: There seems to be a misunderstanding on my part. You don't need to put the __device__ flag and you'll still be able to use it as a pointer to device memory. If you have something valuable to add to good practices on CUDA integration or clarify something else, don't hesitate!

'__' is reserved for implementation purposes. That is why the Nvidia implementation can use __device__. But the other "regular" C++ implementation has its own reserved symbols.
In hindsight Nvidia could have designed a better solution but that is not going to help you here.

Related

Wrapper over Graphics APIs

I'm a huge fan of having a game engine that has the abilty to adapt, not just in what it can do, but also in how it can handle new code. Recently, for my graphics subsystem, I wrote a class to be overriden that works like this:
class LowLevelGraphicsInterface {
virtual bool setRenderTarget(const RenderTarget* renderTarget) = 0;
virtual bool setStreamSource(const VertexBuffer* vertexBuffer) = 0;
virtual bool setShader(const Shader* shader) = 0;
virtual bool draw(void) = 0;
//etc.
};
My idea was to create a list of functions that are universal among most graphics APIs. Then for DirectX11 I would just create a new child class:
class LGI_DX11 : public LowLevelGraphicsInterface {
virtual bool setRenderTarget(const RenderTarget* renderTarget);
virtual bool setStreamSource(const VertexBuffer* vertexBuffer);
virtual bool setShader(const Shader* shader);
virtual bool draw(void);
//etc.
};
Each of these functions would then interface with DX11 directly. I do realize that there is a layer of indirection here. Are people turned off by this fact?
Is this a widely used method? Is there something else I could/should be doing? There is the option of using the preprocessor but that seems messy to me. Someone also mentioned templates to me. What do you guys think?
If the virtual function calls become a problem, there is a compile time method that removes virtual calls using a small amount of preprocessor and a compiler optimization. One possible implementation is something like this:
Declare your base renderer with pure virtual functions:
class RendererBase {
public:
virtual bool Draw() = 0;
};
Declare a specific implementation:
#include <d3d11.h>
class RendererDX11 : public RendererBase {
public:
bool Draw();
private:
// D3D11 specific data
};
Create a header RendererTypes.h to forward declare your renderer based on the type you want to use with some preprocessor:
#ifdef DX11_RENDERER
class RendererDX11;
typedef RendererDX11 Renderer;
#else
class RendererOGL;
typedef RendererOGL Renderer;
#endif
Also create a header Renderer.h to include appropriate headers for your renderer:
#ifdef DX11_RENDERER
#include "RendererDX11.h"
#else
#include "RendererOGL.h"
#endif
Now everywhere you use your renderer refer to it as the Renderer type, include RendererTypes.h in your header files and Renderer.h in your cpp files.
Each of your renderer implementations should be in different projects. Then create different build configurations to compile with whichever renderer implementation you want to use. You don't want to include DirectX code for a Linux configuration for example.
In debug builds, virtual function calls might still be made, but in release builds they are optimized away because you are never making calls through the base class interface. It is only being used to enforce a common signature for your renderer classes at compile time.
While you do need a little bit of preprocessor for this method, it is minimal and doesn't interfere with the readability of your code since it is isolated and limited to some typedefs and includes. The one downside is that you cannot switch renderer implementations at runtime using this method as each implementation will be built to a separate executable. However, there really isn't much need for switching configurations at runtime anyway.
I use the approach with an abstract base class to the render device in my application. Works fine and lets me dynamically choose the renderer to use at runtime. (I use it to switch from DirectX10 to DirectX9 if the former is not supported, i.e. on Windows XP).
I would like to point out that the virtual function call is not the part which costs performance, but the conversion of the argument types involved. To be really generic, the public interface to the renderer uses its own set of parameter types such as a custom IShader and a custom Matrix3D type. No type declared in the DirectX API is visible to the rest of the application, as i.e. OpenGL would have different Matrix types and shader interfaces. The downside of this is really that I have to convert all Matrix and Vector/Point types from my custom type to the one the shader uses in the concrete render device implementation. This is far more expensive than the cost of a virtual function call.
If you do the distinction using the preprocessor, you also need to map the different interface types like this. Many are the same between DirectX10 and DirectX11, but not between DirectX and OpenGL.
Edit: See the answer in c++ Having multiple graphics options for an example implementation.
So, I realize that this is an old question, but I can't resist chiming in. Wanting to write code like this is just a side effect of trying to cope with object-oriented indoctrination.
The first question is whether or not you really need to swap out rendering back-ends, or just think it's cool. If an appropriate back-end can be determined at build time for a given platform, then problem solved: use a plain, non-virtual interface with an implementation selected at build time.
If you find that you really do need to swap it out, still use a non-virtual interface, just load the implementations as shared libraries. With this kind of swapping, you will likely want both engine rendering code and some performance intensive game-specific rendering code factored out and swappable. That way, you can use the common, high-level engine rendering interface for things done mostly by the engine, while still having access to back-end specific code to avoid the conversion costs mentioned by PMF.
Now, it should be said that while swapping with shared libraries introduces indirection, 1. You can easily get the indirection to be < to ~= that of virtual calls and 2. This high-level indirection is never a performance concern in any substantial game/engine. The main benefit is keeping dead code unloaded (and out of the way) and simplifying APIs and overall project design, increasing readability and comprehension.
Beginners aren't typically aware of this, because there is so much blind OO pushing these days, but this style of "OO first, ask questions never" is not without cost. This kind of design has a taxing code comprehension cost and leads to code (much lower-level than this example) that is inherently slow. Object orientation has its place, certainly, but (in games and other performance intensive applications) the best way to design that I have found is to write applications as minimally OO as possible, only conceding when a problem forces your hand. You will develop an intuition for where to draw the line as you gain more experience.

CUDA device functors factory

Let say there is a C++ functor:
class Dummy
{
public:
int operator() (const int a, const int b)
{
return a+b;
}
};
This functor doesn't use any function that can't execute on GPU but it can't be called from CUDA kernel cause there is no __device__ declaration in front of operator(). I would like to create factory class that converts such functors to device compatible functors that can be called within CUDA kernel. For example:
Dummy d;
auto cuda_d = CudaFunctorFactory.get(d);
Can this be accomplished in any way? Feel free to add some constraints as long as it can be accomplished...
The one word answer is no, this isn't possible.
There is no getting around the fact that in the CUDA compilation model, any method code contained in a class or structure which will execute on the GPU must be statically declared and defined at compile time. Somewhere in that code, there has to be a __device__ function available during compilation, otherwise the compilation fails. That is a completely non-negotiable cornerstone of CUDA as it exists today.
A factory design pattern can't sidestep that requirement. Further, I don't think it is possible to implement a factory for GPU instances in host code because there still isn't any way of directly accessing __device__ function pointers from the host, and no way of directly instantiating a GPU class from the host because the constructor must execute on the GPU. At the moment, the only program units which the host can run on the GPU are __global__ functions (ie. kernels), and these cannot be contained within classes. In CUDA, GPU classes passed by argument must be concretely defined, virtual methods aren't supported (and there is not RTTI). That eliminates all the paths I can think of to implement a factory in CUDA C++ for the GPU.
In summary, I don't see any way to make magic that can convert host code to device code at runtime.

Declaring CPP objects inside cuda kernel

I am new to Cuda and I need to know its limits before running my C++ project via Cuda.
Suppose I have a C++ class called MyClass. Knowing that Cuda uses C99, is it possible to declare an object of type MyClass inside a kernel? Would the below code snippet be appropriate?
_global__ void SolveBlaBlaBLa(int x, ...)
{
MyClass obj1;
.
.
.
}
Thanks in Advance,
- Ruru
Just providing an answer to get this off the unanswered list. I think #JaredHoberock will not mind.
In general, CUDA supports a large subset of C++ functionality, including support for objects in device code.
Any code that executes on the device, however, must be properly decorated. For ordinary individual functions (not kernels), the decorator that the compiler recognizes to create a device callable version of the code is __device__. This applies to any object method which may be used on the device, including constructors and destructors.
You may also wish to familiarize yourself with other restrictions on C++ classes used in the device code, as documented in the programming guide.

Using external library classes in CUDA project

I am trying to enhance a small C++ project with CUDA.
My project is using a custom library's classes and functions for example Matrix3d, Vector3d, Plane2d etc. They are mostly geometric objects.
When I try to use my code in the device (either __host__ __device__ functions or a kernel) all the library functions/objects are considered as host code and I get multiple warnings and errors for example error: identifier "Plane3d::~Plane3d" is undefined in device code
Is there a way to use my library on device as well? How is it done?
I don't have experience on CUDA and C++ (I have only used CUDA with simple C code without classes) so I don't get the strategy very well.
Is there a method to avoid changing the library source code? It is possible to change the library's code but it would be better if I could avoid it.
Thanks a lot.
There is no particular problem with using C++ classes in CUDA. The object model is only slightly different to standard C++.
Any structure or class data members are automatically defined in whichever memory space (so host or device) the class or structure is instantiated in. What is not automatic is the code generation for function members and operators within classes and structure. The programmer must explicitly define and compile those for whichever memory space the object will be instantiated in. This latter requirement means you must have both __device__ and __host__ definitions of each function you call within the object. This includes the constructor and destructor, the latter being the error you show in your question.
You don't need to change the source code - what you need is to write an adapter.
CUDA kernels work with low level structures e.g. double*, double*, double** or float*, float*, float** as well as with the built in CUDA types.
CUDA can not work directly on memory allocated outside CUDA anyway (only with memory allocated on the Graphics card, not regular RAM), so you will have to copy your data into the graphics memory.
If you provide methods which have access to the buffers used by your types, you can copy them, continuously if your types have continuous memory, or in chunks if not, into the graphics card (using the CUDA memory copy function), then you can process them with kernels as double*** using simple indexing.

Implications of using std::vector in a dll exported function

I have two dll-exported classes A and B. A's declaration contains a function which uses a std::vector in its signature like:
class EXPORT A{
// ...
std::vector<B> myFunction(std::vector<B> const &input);
};
(EXPORT is the usual macro to put in place _declspec(dllexport)/_declspec(dllimport) accordingly.)
Reading about the issues related to using STL classes in a DLL interface, I gather in summary:
Using std::vector in a DLL interface would require all the clients of that DLL to be compiled with the same version of the same compiler because STL containers are not binary compatible. Even worse, depending on the use of that DLL by clients conjointly with other DLLs, the ''instable'' DLL API can break these client applications when system updates are installed (e.g. Microsoft KB packages) (really?).
Despite the above, if required, std::vector can be used in a DLL API by exporting std::vector<B> like:
template class EXPORT std::allocator<B>;
template class EXPORT std::vector<B>;
though, this is usually mentioned in the context when one wants to use std::vector as a member of A (http://support.microsoft.com/kb/168958).
The following Microsoft Support Article discusses how to access std::vector objects created in a DLL through a pointer or reference from within the executable (http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q172396). The above solution to use template class EXPORT ... seems to be applicable too. However, the drawback summarized under the first bullet point seems to remain.
To completely get rid of the problem, one would need to wrap std::vector and change the signature of myFunction, PIMPL etc..
My questions are:
Is the above summary correct, or do I miss here something essential?
Why does compilation of my class 'A' not generate warning C4251 (class 'std::vector<_Ty>' needs to have dll-interface to be used by clients of...)? I have no compiler warnings turned off and I don't get any warning on using std::vector in myFunction in exported class A (with VS2005).
What needs to be done to correctly export myFunction in A? Is it viable to just export std::vector<B> and B's allocator?
What are the implications of returning std::vector by-value? Assuming a client executable which has been compiled with a different compiler(-version). Does trouble persist when returning by-value where the vector is copied? I guess yes. Similarly for passing std::vector as a constant reference: could access to std::vector<B> (which might was constructed by an executable compiled with a different compiler(-version)) lead to trouble within myFunction? I guess yes again..
Is the last bullet point listed above really the only clean solution?
Many thanks in advance for your feedback.
Unfortunately, your list is very much spot-on. The root cause of this is that DLL-to-DLL or DLL-to-EXE is defined on the level of the operating system, while the the interface between functions is defined on the level of a compiler. In a way, your task is similar (although somewhat easier) to that of client-server interaction, when the client and the server lack binary compatibility.
The compiler maps what it can to the way the DLL importing and exporting is done in a particular operating system. Since language specifications give compilers a lot of liberty when it comes to binary layout of user-defined types and sometimes even built-in types (recall that the exact size of int is compiler-dependent, as long as minimal sizing requirements are met), importing and exporting from DLLs needs to be done manually to achieve binary-level compatibility.
When you use the same version of the same compiler, this last issue above does not create a problem. However, as soon as a different compiler enters the picture, all bets are off: you need to go back to the plainly-typed interfaces, and introduce wrappers to maintain nice-looking interfaces inside your code.
I've been having the same problem and discovered a neat solution to it.
Instead of passing std:vector, you can pass a QVector from the Qt library.
The problems you quote are then handled inside the Qt library and you do not need to deal with it at all.
Of course, the cost is having to use the library and accept its slightly worse performance.
In terms of the amount of coding and debugging time it saves you, this solution is well worth it.