How to pass a function as a cuda kernel parameter? - c++

I would like to create generic cuda kernel that take a callable object as parameter (like lambda or function) and invoke it.
I am having trouble to pass a device function to a cuda kernel as a parameter.
I have cuda 9.2 with compute capability 3.5. I use gcc 9.3 on Debian 10.
I tried this, compiled with nvcc -arch=sm_35 --expt-extended-lambda main.cu -o test:
__host__ __device__ void say_hello()
{
printf("Hello World from function!\n");
}
template<class Function>
__global__ void generic_kernel(Function f)
{
f();
}
int main()
{
// this is working
generic_kernel<<<1,1>>>([]__device__(){printf("Hello World from lambda!\n");});
cudaDeviceSynchronize();
// this is not working!
generic_kernel<<<1,1>>>(say_hello);
cudaDeviceSynchronize();
return 0;
}
I expected to see both Hello World from function! and Hello World from lambda! but I only see the message from the lambda.

Debian is not a supported environment for any version of CUDA. gcc 9.3 is not a supported tool for CUDA 9.2
There are quite a few questions covering these topics here on the cuda tag. This answer links to a number of them.
The short version is that it is fundamentally impossible to capture a __device__ function address in host code. A kernel launch (as you have it here) is written in host code; it is host code. Therefore the use of say_hello there is in host code, and it will refer to the __host__ function pointer/address. That function pointer/address is useless in device code. (Removing the __host__ decorator will not help.)
There are a number of possible solutions, one of which you've already explored. Pass the function wrapped in an object of some sort, and the __device__ lambda when used directly as you have, fits that description.
Another possible fix for the function pointer approach you have that is not working, is to capture the function pointer in device code. It then has to be passed to the host, where it can then be passed back through a kernel launch to device code, and dispatched there. The linked answer above gives a number of ways this can be accomplished.

Related

TensorFlow CPU and CUDA code sharing

I am writing an Op in C++ and CUDA for TensorFlow that has shared custom function code. Usually when code sharing between CPU and CUDA implementations, one would define a macro to insert the __device__ specifier into the function signature, if compiling for CUDA. Is there an inbuilt way to share code in this manner in TensorFlow?
How does one define utility functions(usually inlined) that can run on the CPU and GPU?
It turns out that the following macro's in TensorFlow will do what I describe.
namespace tensorflow{
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
void foo() {
//
}
}

Is it possible to run piece of pure C++ code in GPU

I don't know OpenCL very much but I know C/C++ API requires programmer to provide OpenCL code as a string. But lately I discovered ArrayFire library that doesn't require string-code to invoke some calculations. I wondered how is it working (it is open source but the code is a bit confusing). Would it be possible to write parallel for with OpenCL backend that invokes any piece of compiled (x86 for example) code like following:
template <typename F>
void parallel_for(int starts, int ends, F task) //API
{ /*some OpenCL magic */ }
//...
parallel_for(0, 255, [&tab](int i){ tab[i] *= 0.7; } ); //using
PS: I know I am for 99% too optimistic
You cannot really call C++ Host code from the device using standard OpenCL.
You can use SYCL, the Khronos standard for single-source C++ programming. SYCL allows to compile C++ directly into device code without requiring the OpenCL strings. You can call any C++ function from inside a SYCL kernel (as long as the source code is available). SYCL.tech has more links and updated information.

Is it possible for invoke a kernel function within an another kernel function in CUDA?

Is it possible we invoke a __global__ function within another __global__ function which is also a kernel(__global__) in CUDA?
for example:
__global__ void func()
{
.
.
}
__global__ void foo()
{
.
.
func //this is a "func" function that has defination on the kernel
}
int main(void)
{
.
.
func <<<1, 1 >>>()
foo <<<1, 1 >>>()
}
And could it be use any function from thrust library in a __global__ function ?
Compute capability 3.5 and newer hardware support what is called Dynamic Parallelism, which gives them the ability to have kernels launched by running kernels on the GPU without requiring any host API calls.
Older hardware supports functions which can be called from kernels (these are denoted as __device__ instead of __global__) and are executed at thread scope only, so no new kernel is launched.
Since Thrust 1.8 was release, a serial execution policy has been introduced, which allows thrust algorithms to be call by threads within an existing running kernel, much like __device__functions. Thrust should also support dynamic parallelism via the thrust::device execution policy on supported hardware.

need to convert C++ template to C99 code

I am porting CUDA code to OpenCL - CUDA allows C++ constructs like templates while OpenCL is strictly C99. So, what is the most painless way of porting templatest to C?
I thought of using function pointers for the template parameters.
Before there were templates, there were preprocessor macros.
Search the web for "generic programming in C" for inspiration.
Here is the technique I used for conversion of some of CUDA algorithms from Modern GPU code to my GPGPU VexCL library (with OpenCL support).
Each template function in CUDA code is converted to two template functions in OpenCL host code. The first host function ('name' function) returns mangled name of the generated OpenCL function (so that functions with different template parameters have different names); the second host function ('source' function) returns the string representation of the generated OpenCL function source code. These functions are then used for generation of the main kernel code.
Take, for example, the CTAMergeSort CUDA function template. It gets converted to the two overloads of merge_sort function in VexCL code. I call the 'source' function in order to add the function definition to the OpenCL kernel source here and then use the 'name' function to add its call to the kernel here.
Note that the backend::source_generator in VexCL is used in order to generate either OpenCL or CUDA code transparently. In your case the code generation could be much simpler.
To make it all a bit more clear, here is the code that gets generated for the mergesort<256,11,int,float> template instance:
void mergesort_256_11_int_float
(
int count,
int tid,
int * thread_keys0,
local int * keys_shared0,
float * thread_vals0,
local float * vals_shared0
)
{
if(11 * tid < count) odd_even_transpose_sort_11_int_float(thread_keys0, thread_vals0);
thread_to_shared_11_int(thread_keys0, tid, keys_shared0);
block_sort_loop_256_11_int_float(tid, count, keys_shared0, thread_vals0, vals_shared0);
}
Take a look at Boost.Compute. It provides a C++, STL-like API for OpenCL.

Use cpu function in cuda

I would like to include a C++ function in a CUDA Kernel, but this function is written for CPU like this:
inline float random(int rangeMin,int rangeMax){
return rand(rangeMin,rangeMax);
}
Assume that the rand() function use either curand.h or Thrust cuda library.
I thought in use a Kernel function (with only one GPU thread) that would include this function as inline, so the cuda compiler would generate the binary for the GPU.
Is this possible? If so I would like to include another inlines functions written for the cpu in the CUDA kernel function.
Something like this:
-- InlineCpuFunction.h and InlineCpuFunction.cpp
-- CudaKernel.cuh and CudaKernel.cu (this one include the above header and uses it's function in the CUDA kernel)
If you need some more explanation (as this may look confusing) please ask me.
You can tag the functions you want to use on both the device and the host with both the __host__ __device__ decorators that way it's compiled for your cpu and gpu.