Templated CUDA kernel with dynamic shared memory

Templated CUDA kernel with dynamic shared memory - c++

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write:
template<typename T>
__global__ void kernel(T* ptr)
{
extern __shared__ T smem[];
// calculations here ...
}
template<typename T>
void call_kernel( T* ptr, const int n )
{
dim3 dimBlock(n), dimGrid;
kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr);
}
int main(int argc, char *argv[])
{
const int n = 32;
float *float_ptr;
double *double_ptr;
cudaMalloc( (void**)&float_ptr, n*sizeof(float) );
cudaMalloc( (void**)&double_ptr, n*sizeof(double) );
call_kernel( float_ptr, n );
call_kernel( double_ptr, n ); // problem, 2nd instantiation
cudaFree( (void*)float_ptr );
cudaFree( (void*)double_ptr );
return 0;
}
However, this code cannot be compiled. nvcc gives me the following error message:
main.cu(4): error: declaration is incompatible with previous "smem"
(4): here
detected during:
instantiation of "void kernel(T *) [with T=double]"
(12): here
instantiation of "void call_kernel(T *, int) [with T=double]"
(24): here
I understand that I am running into a name conflict because the shared memory is declared as extern. Nevertheless there is no way around that if I want to define its size during runtime, as far as I know.
So, my question is: Is there any elegant way to obtain the desired behavior? With elegant I mean without code duplication etc.

Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel. So something like this should work:
replace this:
extern __shared__ T smem[];
with this:
extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);
You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.
EDIT: updated my answer to reflect the comment by #njuffa.

(A variation on #RobertCrovella's answer)
NVCC is not willing to accept two extern __shared__ arrays of the same name but different types - even if they're never in each other's scope. We'll need to satisfy NVCC by having our template instances all use the same type for the shared memory under the hood, while letting the kernel code using them see the type it likes.
So we replace this instruction:
extern __shared__ T smem[];
with this one:
auto smem = shared_memory_proxy<T>();
where:
template <typename T>
__device__ T* shared_memory_proxy()
{
// do we need an __align__() here? I don't think so...
extern __shared__ unsigned char memory[];
return reinterpret_cast<T*>(memory);
}
is in some device-side code include file.
Advantages:
One-liner at the site of use.
Simpler syntax to remember.
Separation of concerns - whoever reads the kernel doesn't have to think about why s/he's seeing extern, or alignment specifiers, or a reinterpret cast etc.
Notes:
This is implemented as part of my CUDA kernel author's tools header-only library: shared_memory.cuh (where it's named shared_memory::dynamic::proxy() ).
I have not explored the question of alignment, when you use both dynamic and static shared memory.

Related

Dynamic type cast to the variable type

Suppose, I have a class, like:
struct A
{
uint8_t f1;
int16_t f2;
};
And I need to set it's members values from a memory buffer data, like:
uint8_t * memory=device.getBufferedDataFromDevice();
A a;
a.f1=*((uint8_t*)&memory[someAddress]);
a.f2=*((int16_t*)&memory[someOtherAddress]);
But I'd like to make it more flexible, and avoid the explicit type cast, to have a possibility to change the type in the declaration without changing the rest of the code. Of course, I could achieve it with something like:
memcpy((void*)&a.f1, (void*)&memory[someAddress], sizeof(A::f1));
But I'd also want to avoid calling a function for a simple types like 1-4 bytes long integers (which I have), as the simple assignment could be compiled to a single CPU instruction. Please advise, what is the c++ way to implement this?
Thank you!

memcpy is fully understood by every modern C++ compiler, and there is not going to be an actual function call unless you take its address, store that in a pointer, then confuse the compiler enough that it no longer knows the pointer points at memcpy.
Or, you know, turn off optimizations.
memcpy((void*)&a.f1, (void*)&memory[someAddress], sizeof(A::f1));
there is neither reason to cast to void*, nor use dangerous C-style casts, here.
std::memcpy(&a.f1, &memory[someAddress], sizeof(a.f1));
this is a standards-compliant way to move memory that represents data of the same type as a.f1 over a.f1, assuming a.f1 is trivially copyable. (Note I used the same token sequence -- a.f1 -- for both the written-to stuff and the size.)
The compiler will optimize this into appropriate assembly, and there will be no function-call overhead.
Live example, you can see the generated assembly.
Now, you may object "but there is no guarantee!".
The C++ standard does not include a guarantee that a+b won't be implemented as a loop int r = 0; for (int i = 0; i < a; ++i){++r;} for (int i = 0; i < b; ++i){++r;}.
You cannot presume your C++ compiler is hostile.
Existing C++ compilers optimize calls to memcpy. Writing code assuming it won't happen is a waste of time.
You can also write a slightly safer memcpy
template<class Dest>
void memcpyT( Dest* dest, void const* src ) {
static_assert( std::is_trivially_copyable_v<Dest> );
memcpy( dest, src, sizeof(Dest) );
}
which I included as an alternative in the above live example.

You can have a code similar to this:
template<typename T>
void mymemcopy(T* a, void* b) {
memcpy((void*)a, b, sizeof(T));
}
template<typename T>
constexpr void mymemcopy(T** a, void* b) {
*a = static_cast<T*>(b);
}
constexpr void mymemcopy(int* a, void* b) {
*a = *(int*)b;
}
constexpr void mymemcopy(unsigned char* a, void* b) {
*a = *(unsigned char*)b;
}
int main()
{
int a, b =10;
mymemcopy(&a, &b);
double a1, b1 =10;
mymemcopy(&a1, &b1);
unsigned char a2, b2 =10;
mymemcopy(&a2, &b2);
unsigned char *a3, *b3 =nullptr;
mymemcopy(&a3, &b3);
}
I somehow think your case use is for embedded programming and I'm not expert. I know in embedded programming you need to decrease both memory usage and code. But you are asking will increase code size obviously.

How can I indicate to the compiler that a pointer parameter is aligned?

I'm writing the spectacular function:
void foo(void* a) {
if (check_something_at_runtime_only()) {
int* as_ints { a };
// do things with as_ints
}
else {
char* as_chars { a };
// do things with as_chars
}
}
Suppose we know that some work with as_ints would benefit from it being better-aligned; e.g. if the memory transaction size on my platform is N bytes, then I can read the first N/sizeof(int) elements with a single machine instruction (ignoring SIMD/vectorization here) - provided a is N-byte-aligned.
Now, I could indicate alignment by having foo always take an int * - at least on platforms for which larger types can only be read from aligned addresses - but I would rather keep the type void *, since it doesn't have to be an array of ints, really.
I would have liked to be able to write something like
void foo(alignas(sizeof(int)) void* a) { ... }
but, apparently, alignas doesn't apply to pointers, so I can't.
Is there another way to guarantee to the compiler than the argument address will be aligned?
Notes:
I'm interested both in what the C++ standard (any version) allows, and in compiler-specific extensions in GCC, clang and NVCC (the CUDA compiler).

In C++20 you can use std::assume_aligned:
#include <memory>
int *as_ints = std::assume_aligned<sizeof(int)>(a);

In GCC/Clang you can do
int *as_ints = __builtin_assume_aligned(a);
or if a is function parameter just mark it directly with __attribute((aligned(4))).

Warning while allocating CUDA device memory using C++ templates

I have declared the following template to make code shorter:
template <typename T>
void allocateGPUSpace(T* ptr, int size){
cudaMalloc((void**)&ptr,size * sizeof(T));
}
Moreover, I use the template as follows:
float* alphaWiMinusOne;
allocateGPUSpace<float>( alphaWiMinusOne,numUnigrams);
However, when i compile the code, VS 2008 gives the warning
warning: variable "alphaWiMinusOne" is used before its value is set
and
uninitialized local variable 'alphaWiMinusOne' used
Does cuda not understand templates in C++? Gosh, that will be a MUST do for nvidia

Firstly, that warning doesn't come from CUDA, it comes from the host compiler (so Microsoft's C++ compiler or GCC depending on your platform), and it is a perfectly valid warning. You have made the same mistake you made here, and this code won't work as you are hoping, because you are passing the pointer to operate on by value, not by reference. Your code should be like this:
template <typename T>
void allocateGPUSpace(T ** ptr, int size){
cudaMalloc((void**)ptr, size * sizeof(T));
}
and the call like this:
float * alphaWiMinusOne;
allocateGPUSpace<float>(&alphaWiMinusOne, numUnigrams);
or perhaps
template <typename T>
T * allocateGPUSpace(int size){
T * ptr;
cudaMalloc((void**)&ptr, size * sizeof(T));
return ptr;
}
and then
float * alphaWiMinusOne = allocateGPUSpace<float>(numUnigrams);
Using either will eliminate the compiler warnings and the code will work. As a note of style, it would be a rather short sighted helper function design that didn't include any error checking.......

kernel function parameter as const

say I have a kernel
foo(int a, int b)
{
__shared__ int array[a];
}
it seems a has to be a constant value, I added const in front of int. It sill didn't work out,
any idea?
foo(const int a, const int b)
{
__shared__ int array[a];
}

While you can't have a dynamically-sized array because of the constraints of the C language (as mentioned in other answers), what you can do in CUDA is something like this:
extern __shared__ float fshared[];
__global__ void testShmem( float * result, unsigned int shmemSize ) {
// use fshared - shmemSize tells you how many bytes
// Note that the following is not a sensible use of shared memory!
for( int i = 0; i < shmemSize/sizeof(float); ++i ) {
fshared[i] = 0;
}
}
providing you tell CUDA how much shared memory you want during kernel invocation, like so:
testShmem<<<grid, block, 1024>>>( pdata, 1024 );

In ISO C++ the size of an array needs to be a so-called constant expression. This is stronger than a const-qualified variable. It basically means compile-time constant. So, the value has to be known at compile-time.
In ISO C90 this was also the case. C99 added VLAs, variable-length-arrays, that allow the size to be determined at runtime. The sizeof operator for these VLAs becomes a runtime operator.
I'm not familiar with CUDA or the __shared__ syntax. It's not clear to me why/how you use the term kernel. But I guess the rules are similar w.r.t. constant expressions and arrays.

I don't think CUDA or OpenCL let you dynamically allocate shared memory. Use #define macro instead.
If you need a dynamic sized array on a per program basis, you can supply it using -D MYMACRO (with OpenCL, I don't know for CUDA). See Bahbar's answer.

Here's how you can statically allocate a __shared__ array of n values in CUDA using C++ templates
template <int n>
kernel(...)
{
__shared__ int array[n];
}
const int n = 128;
kernel<n><<<grid_size,block_size>>>(...);
Note that n must be a known constant at compile time for this to work. If n is not known at compile time then you must use the approach Edric suggests.

I suspect this is a C language question.
If it were C++, you could simply use std::vector.
void foo( int a, int b )
{
std::vector<int> array( a );
// ...
}
It if really is C++, then what C++ features you can use safely may depend on the environment. It's not clear what you mean by "kernel".

Calculating size of an array

I am using the following macro for calculating size of an array:
#define G_N_ELEMENTS(arr) ((sizeof(arr))/(sizeof(arr[0])))
However I see a discrepancy in the value computed by it when I evaluate the size of an array in a function (incorrect value computed) as opposed to where the function is called (correct value computed). Code + output below. Any thoughts, suggestions, tips et al. welcome.
DP
#include <stdio.h>
#define G_N_ELEMENTS(arr) ((sizeof(arr))/(sizeof(arr[0])))
void foo(int * arr) // Also tried foo(int arr[]), foo(int * & arr)
// - neither of which worked
{
printf("arr : %x\n", arr);
printf ("sizeof arr: %d\n", G_N_ELEMENTS(arr));
}
int main()
{
int arr[] = {1, 2, 3, 4};
printf("arr : %x\n", arr);
printf ("sizeof arr: %d\n", G_N_ELEMENTS(arr));
foo(arr);
}
Output:
arr : bffffa40
sizeof arr: 4
arr : bffffa40
sizeof arr: 1

That's because the size of an int * is the size of an int pointer (4 or 8 bytes on modern platforms that I use but it depends entirely on the platform). The sizeof is calculated at compile time, not run time, so even sizeof (arr[]) won't help because you may call the foo() function at runtime with many different-sized arrays.
The size of an int array is the size of an int array.
This is one of the tricky bits in C/C++ - the use of arrays and pointers are not always identical. Arrays will, under a great many circumstances, decay to a pointer to the first element of that array.
There are at least two solutions, compatible with both C and C++:
pass the length in with the array (not that useful if the intent of the function is to actually work out the array size).
pass a sentinel value marking the end of the data, e.g., {1,2,3,4,-1}.

This isn't working because sizeof is calculated at compile-time. The function has no information about the size of its parameter (it only knows that it points to a memory address).
Consider using an STL vector instead, or passing in array sizes as parameters to functions.

In C++, you can define G_N_ELEMENTS like this :
template<typename T, size_t N>
size_t G_N_ELEMENTS( T (&array)[N] )
{
return N;
}
If you wish to use array size at compile time, here's how :
// ArraySize
template<typename T>
struct ArraySize;
template<typename T, size_t N>
struct ArraySize<T[N]>
{
enum{ value = N };
};
Thanks j_random_hacker for correcting my mistakes and providing additional information.

Note that even if you try to tell the C compiler the size of the array in the function, it doesn't take the hint (my DIM is equivalent to your G_N_ELEMENTS):
#include <stdio.h>
#define DIM(x) (sizeof(x)/sizeof(*(x)))
static void function(int array1[], int array2[4])
{
printf("array1: size = %u\n", (unsigned)DIM(array1));
printf("array2: size = %u\n", (unsigned)DIM(array2));
}
int main(void)
{
int a1[40];
int a2[4];
function(a1, a2);
return(0);
}
This prints:
array1: size = 1
array2: size = 1
If you want to know how big the array is inside a function, pass the size to the function. Or, in C++, use things like STL vector<int>.

Edit: C++11 was introduced since this answer was written, and it includes functions to do exactly what I show below: std::begin and std::end. Const versions std::cbegin and std::cend are also going into a future version of the standard (C++14?) and may be in your compiler already. Don't even consider using my functions below if you have access to the standard functions.
I'd like to build a little on Benoît's answer.
Rather than passing just the starting address of the array as a pointer, or a pointer plus the size as others have suggested, take a cue from the standard library and pass two pointers to the beginning and end of the array. Not only does this make your code more like modern C++, but you can use any of the standard library algorithms on your array!
template<typename T, int N>
T * BEGIN(T (& array)[N])
{
return &array[0];
}
template<typename T, int N>
T * END(T (& array)[N])
{
return &array[N];
}
template<typename T, int N>
const T * BEGIN_CONST(const T (& array)[N])
{
return &array[0];
}
template<typename T, int N>
const T * END_CONST(const T (& array)[N])
{
return &array[N];
}
void
foo(int * begin, int * end)
{
printf("arr : %x\n", begin);
printf ("sizeof arr: %d\n", end - begin);
}
int
main()
{
int arr[] = {1, 2, 3, 4};
printf("arr : %x\n", arr);
printf ("sizeof arr: %d\n", END(arr) - BEGIN(arr));
foo(BEGIN(arr), END(arr));
}
Here's an alternate definition for BEGIN and END, if the templates don't work.
#define BEGIN(array) array
#define END(array) (array + sizeof(array)/sizeof(array[0]))
Update: The above code with the templates works in MS VC++2005 and GCC 3.4.6, as it should. I need to get a new compiler.
I'm also rethinking the naming convention used here - template functions masquerading as macros just feels wrong. I'm sure I will use this in my own code sometime soon, and I think I'll use ArrayBegin, ArrayEnd, ArrayConstBegin, and ArrayConstEnd.

If you change the foo funciton a little it might make you feel a little more comfortable:
void foo(int * pointertofoo)
{
printf("pointertofoo : %x\n", pointertofoo);
printf ("sizeof pointertofoo: %d\n", G_N_ELEMENTS(pointertofoo));
}
That's what the compiler will see something that is completely a different context than the function.

foo(int * arr) //Also tried foo(int arr[]), foo(int * & arr)
{ // - neither of which worked
printf("arr : %x\n", arr);
printf ("sizeof arr: %d\n", G_N_ELEMENTS(arr));
}
sizeof(arr) is sizeof(int*), ie. 4
Unless you have a very good reason for writing code like this, DON'T. We're in the 21st century now, use std::vector instead.
For more info, see the C++ FAQ: http://www.parashift.com/c++-faq-lite/containers.html
Remember: "Arrays are evil"

You should only call sizeof on the array. When you call sizeof on the pointer type the size will always be 4 (or 8, or whatever your system does).
MSFT's Hungarian notation may be ugly, but if you use it, you know not to call your macro on anything that starts with a 'p'.
Also checkout the definition of the ARRAYSIZE() macro in WinNT.h. If you're using C++ you can do strange things with templates to get compile time asserts if do it that way.

Now that we have constexpr in C++11, the type safe (non-macro) version can also be used in a constant expression.
template<typename T, std::size_t size>
constexpr std::size_t array_size(T const (&)[size]) { return size; }
This will fail to compile where it does not work properly, unlike your macro solution (it won't work on pointers by accident). You can use it where a compile-time constant is required:
int new_array[array_size(some_other_array)];
That being said, you are better off using std::array for this if possible. Pay no attention to the people who say to use std::vector because it is better. std::vector is a different data structure with different strengths. std::array has no overhead compared to a C-style array, but unlike the C-style array it will not decay to a pointer at the slightest provocation. std::vector, on the other hand, requires all accesses to be indirect accesses (go through a pointer) and using it requires dynamic allocation. One thing to keep in mind if you are used to using C-style arrays is to be sure to pass std::array to a function like this:
void f(std::array<int, 100> const & array);
If you do not pass by reference, the data is copied. This follows the behavior of most well-designed types, but is different from C-style arrays when passed to a function (it's more like the behavior of a C-style array inside of a struct).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Templated CUDA kernel with dynamic shared memory - c++

Related

Dynamic type cast to the variable type

How can I indicate to the compiler that a pointer parameter is aligned?

Warning while allocating CUDA device memory using C++ templates

kernel function parameter as const

Calculating size of an array

Categories

Resources