I'm sorry for the bad title...
I would like to have a class with a static property value that I could use in device code. What I tried s the following:
struct MyConstValue
{
static __constant__ int value;
};
In theory, now, I should define MyConstValue::value, initialize it, probably through cudaMemcpyToSymbol, then I could write a kernel that access this value through MyGlobalValue::value.
If I add
int __constant__ MyConstValue::value;
for the sake of defining the symbol (both with and without __constant__), nvcc outputs
error: โstaticโ may not be used when defining (as opposed to declaring) a static data member [-fpermissive]
Is there a way to implement my idea?
I'm using CUDA 5.5, I target compute capabilities > 2.0.
Thanks in advance.
There is no support for static class members in CUDA.
The reason might be that there is no defined point at which it would be initialized, if all threads would do so, or if just one, and if so, which thread. So static data just doesn't make sense in this context.
From the NVIDIA forum:
But what would a "static class member" idiom even mean on a GPU? It
can't be the same as the GPU since there's so many new questions about
its definition. Perhaps every thread has its own static member, even
if that thread accesses multiple copies of the class? Every block has
a single static member? Every kernel? Every DEVICE, since classes can
live in memory beyond kernel invocations?
From B 2.2 of the CUDA programming guide:
The constant qualifier, optionally used together with device,
declares a variable that:
Resides in constant memory space, Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host
through the runtime library (cudaGetSymbolAddress() /
cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()
for the runtime API and cuModuleGetGlobal() for the driver API).
You may take a look at this thread.
Related
I'm studying class ExplicitInit in objc runtime source code. I know that class ExplicitInit is used in static objc::ExplicitInit<StripedMap<SideTable>> SideTablesMap;. But why ExplicitInit is necessary? StripedMap is good enough to store SideTable.
I think the comments for class ExplicitInit in DenseMapExtras.h is the key to understand why ExplicitInit is necessary. But I can't understand the comments because of my poor C++ knowledge.
the comments show below:
// We cannot use a C++ static initializer to initialize certain globals because
// libc calls us before our C++ initializers run. We also don't want a global
// pointer to some globals because of the extra indirection.
//
// ExplicitInit / LazyInit wrap doing it the hard way.
there are three sentences in comments above, but I can't understand them all. Can anybody help me explain them?
And don't forget the first question: why ExplicitInit is necessary? ๐๐
We cannot use a C++ static initializer to initialize certain globals because libc calls us before our C++ initializers run.
There is some deep dyld and Mach-O voodoo that I don't understand when it comes to the Objective-C runtime! This was a fun question to try dig into.
The function _objc_init() (in objc-os.mm) performs:
Bootstrap initialization. Registers our image notifier with dyld.
Called by libSystem BEFORE library initialization time
_objc_init() calls runtime_init(), a portion of which is objc::allocatedClasses.init();. allocatedClasses is declared as static ExplicitInitDenseSet<Class> allocatedClasses; in objc-runtime-new.mm.
So it appears libSystem calls _objc_init() which in turns calls runtime_init() which in turn uses a C++ static variable that hasn't yet been constructed because the library itself hasn't been initialized, hence the need for ExplicitInit.
We also don't want a global pointer to some globals because of the extra indirection.
I don't know why this design wasn't chosen. Perhaps for cache locality?
ExplicitInit / LazyInit wrap doing it the hard way.
The init() methods in these classes use placement new to explicitly initialize the underlying type into the space reserved for it in _storage. This would normally happen automatically during static initialization if the type was instantiated directly. In this case space is reserved for the underlying object so it may be explicitly initialized by calling init() which constructs the object.
In class I want to have constant array of constant C strings:
.cpp
const char* const Colors::Names[] = {
"red",
"green"
};
.h
class Colors {
public:
static const char* const Names[];
};
The array should be common to all instances of class Colors (even though I plan to have just one instance but it should not metter), hence declaring array static.
The requirment is that if class is not instantied, array should not consume any memory in binary file.
However, with above solution, it does consume:
.rodata._ZN6Colors5NamesE
0x00000000 0x8
not sure about C strings itself as cannot find them in a map file but I assume they consume memory as well.
I know that one solution to this would be to use constexpr and C++17 where is it no longer needed to have definition of static constexpr members outside of class.
However, for some reasons (i.e. higher compilation times in my build system and slighlty higher program memory footprint) I don't want to change c++ standard version.
Another idea is to drop static (as I plan to have one instance anyway). However, the first issue with this solution is that I have to specify array size, which I would rather prefer not to do, otherwise I get:
error: flexible array member 'Colors::Names' in an otherwise empty 'class Colors'
Second issue is that array is placed in RAM section (inside class object), and only C strings are placed in FLASH memory.
Does anyone know other solutuions to this issue?
PS. My platform is Stm32 MCU and using GCC ARM compiler
EDIT (to address some of the answers in comments)
As suggested in comments this can't be done with just static members.
So the question should probably actually be: How to create (non-static) class array member, that's placed in read only memory (not initialized), which is placed in a memory only if the class is actually used in the program and preferably common for all instances of that class? Array itself is only used from that class.
Some background info:
Let's say that array has size of 256, and each C string 40 chars. That's 1kB for array + 10kB for C strings (32 bit architecture). Class is a part of library that is used by different projects (programs). If the class is not used in that project then I don't want that it (and it's array) would occupy even a single byte beacuse I need that FLASH space for other things, therefore compresion is not an option.
If there will be no other solutions then I will consider possiblity of removing unused sections by linker (alothough was hoping for a simpler solution).
Thanks for all suggestions.
Let say there is a C++ functor:
class Dummy
{
public:
int operator() (const int a, const int b)
{
return a+b;
}
};
This functor doesn't use any function that can't execute on GPU but it can't be called from CUDA kernel cause there is no __device__ declaration in front of operator(). I would like to create factory class that converts such functors to device compatible functors that can be called within CUDA kernel. For example:
Dummy d;
auto cuda_d = CudaFunctorFactory.get(d);
Can this be accomplished in any way? Feel free to add some constraints as long as it can be accomplished...
The one word answer is no, this isn't possible.
There is no getting around the fact that in the CUDA compilation model, any method code contained in a class or structure which will execute on the GPU must be statically declared and defined at compile time. Somewhere in that code, there has to be a __device__ function available during compilation, otherwise the compilation fails. That is a completely non-negotiable cornerstone of CUDA as it exists today.
A factory design pattern can't sidestep that requirement. Further, I don't think it is possible to implement a factory for GPU instances in host code because there still isn't any way of directly accessing __device__ function pointers from the host, and no way of directly instantiating a GPU class from the host because the constructor must execute on the GPU. At the moment, the only program units which the host can run on the GPU are __global__ functions (ie. kernels), and these cannot be contained within classes. In CUDA, GPU classes passed by argument must be concretely defined, virtual methods aren't supported (and there is not RTTI). That eliminates all the paths I can think of to implement a factory in CUDA C++ for the GPU.
In summary, I don't see any way to make magic that can convert host code to device code at runtime.
I am trying to enhance a small C++ project with CUDA.
My project is using a custom library's classes and functions for example Matrix3d, Vector3d, Plane2d etc. They are mostly geometric objects.
When I try to use my code in the device (either __host__ __device__ functions or a kernel) all the library functions/objects are considered as host code and I get multiple warnings and errors for example error: identifier "Plane3d::~Plane3d" is undefined in device code
Is there a way to use my library on device as well? How is it done?
I don't have experience on CUDA and C++ (I have only used CUDA with simple C code without classes) so I don't get the strategy very well.
Is there a method to avoid changing the library source code? It is possible to change the library's code but it would be better if I could avoid it.
Thanks a lot.
There is no particular problem with using C++ classes in CUDA. The object model is only slightly different to standard C++.
Any structure or class data members are automatically defined in whichever memory space (so host or device) the class or structure is instantiated in. What is not automatic is the code generation for function members and operators within classes and structure. The programmer must explicitly define and compile those for whichever memory space the object will be instantiated in. This latter requirement means you must have both __device__ and __host__ definitions of each function you call within the object. This includes the constructor and destructor, the latter being the error you show in your question.
You don't need to change the source code - what you need is to write an adapter.
CUDA kernels work with low level structures e.g. double*, double*, double** or float*, float*, float** as well as with the built in CUDA types.
CUDA can not work directly on memory allocated outside CUDA anyway (only with memory allocated on the Graphics card, not regular RAM), so you will have to copy your data into the graphics memory.
If you provide methods which have access to the buffers used by your types, you can copy them, continuously if your types have continuous memory, or in chunks if not, into the graphics card (using the CUDA memory copy function), then you can process them with kernels as double*** using simple indexing.
I am using CUDA 5.0 and a Compute Capability 2.1 card.
The question is quite straightforward: Can a kernel be part of a class?
For example:
class Foo
{
private:
//...
public:
__global__ void kernel();
};
__global__ void Foo::kernel()
{
//implementation here
}
If not then the solution is to make a wrapper function that is member of the class and calls the kernel internally?
And if yes, then will it have access to the private attributes as a normal private function?
(I'm not just trying it and see what happens because my project has several other errors right now and also I think it's a good reference question. It was difficult for me to find reference for using CUDA with C++. Basic functionality examples can be found but not strategies for structured code.)
Let me leave cuda dynamic parallelism out of the discussion for the moment (i.e. assume compute capability 3.0 or prior).
remember __ global__ is used for cuda functions that will (only) be called from the host (but execute on the device). If you instantiate this object on the device, it won't work. Furthermore, to get device-accessible private data to be available to the member function, the object would have to be instantiated on the device.
So you could have a kernel invocation (ie. mykernel<<<blocks,threads>>>(...); embedded in a host object member function, but the kernel definition (i.e. the function definition with the __ global__ decorator) would normally precede the object definition in your source code. And as stated already, such a methodology could not be used for an object instantiated on the device. It would also not have access to ordinary private data defined elsewhere in the object. (It may be possible to come up with a scheme for a host-only object that does create device data, using pointers in global memory, that would then be accessible on the device, but such a scheme seems quite convoluted to me at first glance).
Normally, device-usable member functions would be preceded by the __ device__ decorator. In this case, all the code in the device member function executes from within the thread that called it.
This question gives an example (in my edited answer) of a C++ object with a member function callable from both the host and the device, with appropriate data copying between host and device objects.