I am using CUDA 5.0 and a Compute Capability 2.1 card.
The question is quite straightforward: Can a kernel be part of a class?
For example:
class Foo
{
private:
//...
public:
__global__ void kernel();
};
__global__ void Foo::kernel()
{
//implementation here
}
If not then the solution is to make a wrapper function that is member of the class and calls the kernel internally?
And if yes, then will it have access to the private attributes as a normal private function?
(I'm not just trying it and see what happens because my project has several other errors right now and also I think it's a good reference question. It was difficult for me to find reference for using CUDA with C++. Basic functionality examples can be found but not strategies for structured code.)
Let me leave cuda dynamic parallelism out of the discussion for the moment (i.e. assume compute capability 3.0 or prior).
remember __ global__ is used for cuda functions that will (only) be called from the host (but execute on the device). If you instantiate this object on the device, it won't work. Furthermore, to get device-accessible private data to be available to the member function, the object would have to be instantiated on the device.
So you could have a kernel invocation (ie. mykernel<<<blocks,threads>>>(...); embedded in a host object member function, but the kernel definition (i.e. the function definition with the __ global__ decorator) would normally precede the object definition in your source code. And as stated already, such a methodology could not be used for an object instantiated on the device. It would also not have access to ordinary private data defined elsewhere in the object. (It may be possible to come up with a scheme for a host-only object that does create device data, using pointers in global memory, that would then be accessible on the device, but such a scheme seems quite convoluted to me at first glance).
Normally, device-usable member functions would be preceded by the __ device__ decorator. In this case, all the code in the device member function executes from within the thread that called it.
This question gives an example (in my edited answer) of a C++ object with a member function callable from both the host and the device, with appropriate data copying between host and device objects.
Related
When using dynamic libraries, I understand that we should only pass Plain Old Data-structures across boundaries. So can we pass a pointer to base ?
My idea is that the application and the library could both be aware of a common Interface (pure virtual method, = 0).
The library could instantiate a subtype of that Interface,
And the application could use it.
For instance, is the following snippet safe ?
// file interface.h
class IPrinter{
virtual void print(std::string str) = 0;
};
-
// file main.cpp
int main(){
//load plugin...
IPrinter* printer = plugin_get_printer();
printer->print( std::string{"hello"} );
}
-
// file plugin.cpp (compiled by another compiler)
IPrinter* plugin_get_printer(){
return new PrinterImpl{};
}
This snippet is not safe:
the two sides of your DLL boundaries do not use the same compiler. This means that the name mangling (for function names) and the vtable layout (for virtual functions) might not be the same (implementation specific.
the heap on both sides may also be managed differently, thus you have risks related to the deleting of your object if it's not done in the DLL.
This article presents very well the main challenges with binary compatible interfaces.
You may however pass to the other side of the mirror a pointer, as part of a POD as long as the other part doesn't us it by iself (f.ex: your app passes a pointer to a configuration object to the DLL. Later another DLL funct returns that pointer to your app. Your app can then use it as expected (at least if it wasn't a pointer to a local object that no longer exists) .
The presence of virtual functions in your class means that your class is going to have a vtable, and different compilers implement vtables differently.
So, if you use classes with virtual methods across DLL calls where the compiler used on the other side is different from the compiler that you are using, the result is likely to be spectacular crashes.
In your case, the PrinterImpl created by the DLL will have a vtable constructed in a certain way, but the printer->print() call in your main() will attempt to interpret the vtable of IPrinter in a different way in order to resolve the print() method call.
Let say there is a C++ functor:
class Dummy
{
public:
int operator() (const int a, const int b)
{
return a+b;
}
};
This functor doesn't use any function that can't execute on GPU but it can't be called from CUDA kernel cause there is no __device__ declaration in front of operator(). I would like to create factory class that converts such functors to device compatible functors that can be called within CUDA kernel. For example:
Dummy d;
auto cuda_d = CudaFunctorFactory.get(d);
Can this be accomplished in any way? Feel free to add some constraints as long as it can be accomplished...
The one word answer is no, this isn't possible.
There is no getting around the fact that in the CUDA compilation model, any method code contained in a class or structure which will execute on the GPU must be statically declared and defined at compile time. Somewhere in that code, there has to be a __device__ function available during compilation, otherwise the compilation fails. That is a completely non-negotiable cornerstone of CUDA as it exists today.
A factory design pattern can't sidestep that requirement. Further, I don't think it is possible to implement a factory for GPU instances in host code because there still isn't any way of directly accessing __device__ function pointers from the host, and no way of directly instantiating a GPU class from the host because the constructor must execute on the GPU. At the moment, the only program units which the host can run on the GPU are __global__ functions (ie. kernels), and these cannot be contained within classes. In CUDA, GPU classes passed by argument must be concretely defined, virtual methods aren't supported (and there is not RTTI). That eliminates all the paths I can think of to implement a factory in CUDA C++ for the GPU.
In summary, I don't see any way to make magic that can convert host code to device code at runtime.
I'm sorry for the bad title...
I would like to have a class with a static property value that I could use in device code. What I tried s the following:
struct MyConstValue
{
static __constant__ int value;
};
In theory, now, I should define MyConstValue::value, initialize it, probably through cudaMemcpyToSymbol, then I could write a kernel that access this value through MyGlobalValue::value.
If I add
int __constant__ MyConstValue::value;
for the sake of defining the symbol (both with and without __constant__), nvcc outputs
error: ‘static’ may not be used when defining (as opposed to declaring) a static data member [-fpermissive]
Is there a way to implement my idea?
I'm using CUDA 5.5, I target compute capabilities > 2.0.
Thanks in advance.
There is no support for static class members in CUDA.
The reason might be that there is no defined point at which it would be initialized, if all threads would do so, or if just one, and if so, which thread. So static data just doesn't make sense in this context.
From the NVIDIA forum:
But what would a "static class member" idiom even mean on a GPU? It
can't be the same as the GPU since there's so many new questions about
its definition. Perhaps every thread has its own static member, even
if that thread accesses multiple copies of the class? Every block has
a single static member? Every kernel? Every DEVICE, since classes can
live in memory beyond kernel invocations?
From B 2.2 of the CUDA programming guide:
The constant qualifier, optionally used together with device,
declares a variable that:
Resides in constant memory space, Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host
through the runtime library (cudaGetSymbolAddress() /
cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()
for the runtime API and cuModuleGetGlobal() for the driver API).
You may take a look at this thread.
I am new to Cuda and I need to know its limits before running my C++ project via Cuda.
Suppose I have a C++ class called MyClass. Knowing that Cuda uses C99, is it possible to declare an object of type MyClass inside a kernel? Would the below code snippet be appropriate?
_global__ void SolveBlaBlaBLa(int x, ...)
{
MyClass obj1;
.
.
.
}
Thanks in Advance,
- Ruru
Just providing an answer to get this off the unanswered list. I think #JaredHoberock will not mind.
In general, CUDA supports a large subset of C++ functionality, including support for objects in device code.
Any code that executes on the device, however, must be properly decorated. For ordinary individual functions (not kernels), the decorator that the compiler recognizes to create a device callable version of the code is __device__. This applies to any object method which may be used on the device, including constructors and destructors.
You may also wish to familiarize yourself with other restrictions on C++ classes used in the device code, as documented in the programming guide.
OK, here's what I want :
I have written several REALLY demanding functions (mostly operating on bitmaps, etc) which have to be as fast as possible
Now, let's also mention that these functions may also be grouped by type, or even by the type of variable on which they operate.
And the thing is, apart from the very implementation of the algorithms, what I should do - from a technical point of view - in order not to mess up the speed.
And now, I'm considering the following scenarios :
Create them as simple functions and just pass the necessary parameters as arguments
Create a class (for 'grouping'/organisation purposes) and just declare them as static
Create class by type, e.g. Create a class for working on bitmaps, create a new instance of that Class for every bitmap (e.g. Bitmap* myBitmap = newBitmap(1010);, and operate on it with its inner methods (e.g. myBitmap->getFirstBitSet())
Now, which of these approaches is the fastest? Is there really any difference between straight simple functions and Class-encapsulated static functions, performance-wise? Any other scenario that would be preferable, which I haven't mentioned?
Sidenote : I'm using the clang++ compiler, for Mac OS X 10.6.8. (if that makes any difference)
At CPU level, there is only one kind of function, and it very much ressemble the C kind. You could craft your own, but...
As it turns out, C++ being built with efficiency in mind maps most functions directly to call instructions:
a namespace level function is like a regular C function
a static method is like a namespace level function (from a call point of view)
a non-static method is very similar to a static method, except an implicit this parameter is passed on top of the other parameters (one pointer)
All those 3 have the exact same kind of performance.
On the other hand, virtual methods have a slight overhead. There was a C++ technical report on performance which estimated the overhead compared to a non-virtual method between 10% and 15% (from memory) for empty functions. Meaning that for any function with meat inside (ie, doing real work), the overhead itself is close to getting lost in the noise. The real cost comes from the inhibition of inlining unless the virtual call can be deduced at compile-time.
There is absolutely no difference between classic old C functions and static methods of classes. The difference is only aesthetic. If you have multiple C functions that have certain relation between them, you can:
group them into a class;
place them into a namespace;
The difference will again be aesthetic. Most likely this will improve readability.
In case if these C functions share some static data, it would make sense (if possible) to define this data as private static data members of a class. In this case variant with the class would be preferable over the variant with namespace.
I would discourage you from creating a dummy instance. This will be misleading to the reader of the source code.
Creating an instance for every bitmap is possible and can even be favorable. Especially if you call methods on this instance several times in a typical scenario.