How to compute exponential of a matrix inside CUDA thread? - c++

I need somehow to be able to compute the exponential of a matrix inside a CUDA kernel. Is there any library whose function for this task could be called from within CUDA thread? Or maybe would it be possible to implement this function from scratch as __device__ function?
I am using Microsoft Visual Studio 2008 Express for host code compilation and nvcc compiler from toolkit 3.2v.
GPU: NVIDIA GeForce GT640 (compute capability 3.0)

No there's no such things in CUDA library but you might look at this code to help you designing a solution in CUDA:
https://github.com/poliu2s/MKL/blob/master/matrix_exponential.cpp
If you are working on an architecture 3.5, it could be easier to solve your problem (with dynamic paralleslism) by calling a __global__ kernel from an other __global__ kernel without returning on the host so you can set the configuration you want to execute it (threads and blocks).
Basically:
__global__ child( ... )
{
....
}
__global__ parent( ... )
{
child<<< ..., ... >>>( ... )
}
Hope this can help

Related

How to pass a function as a cuda kernel parameter?

I would like to create generic cuda kernel that take a callable object as parameter (like lambda or function) and invoke it.
I am having trouble to pass a device function to a cuda kernel as a parameter.
I have cuda 9.2 with compute capability 3.5. I use gcc 9.3 on Debian 10.
I tried this, compiled with nvcc -arch=sm_35 --expt-extended-lambda main.cu -o test:
__host__ __device__ void say_hello()
{
printf("Hello World from function!\n");
}
template<class Function>
__global__ void generic_kernel(Function f)
{
f();
}
int main()
{
// this is working
generic_kernel<<<1,1>>>([]__device__(){printf("Hello World from lambda!\n");});
cudaDeviceSynchronize();
// this is not working!
generic_kernel<<<1,1>>>(say_hello);
cudaDeviceSynchronize();
return 0;
}
I expected to see both Hello World from function! and Hello World from lambda! but I only see the message from the lambda.
Debian is not a supported environment for any version of CUDA. gcc 9.3 is not a supported tool for CUDA 9.2
There are quite a few questions covering these topics here on the cuda tag. This answer links to a number of them.
The short version is that it is fundamentally impossible to capture a __device__ function address in host code. A kernel launch (as you have it here) is written in host code; it is host code. Therefore the use of say_hello there is in host code, and it will refer to the __host__ function pointer/address. That function pointer/address is useless in device code. (Removing the __host__ decorator will not help.)
There are a number of possible solutions, one of which you've already explored. Pass the function wrapped in an object of some sort, and the __device__ lambda when used directly as you have, fits that description.
Another possible fix for the function pointer approach you have that is not working, is to capture the function pointer in device code. It then has to be passed to the host, where it can then be passed back through a kernel launch to device code, and dispatched there. The linked answer above gives a number of ways this can be accomplished.

TensorFlow CPU and CUDA code sharing

I am writing an Op in C++ and CUDA for TensorFlow that has shared custom function code. Usually when code sharing between CPU and CUDA implementations, one would define a macro to insert the __device__ specifier into the function signature, if compiling for CUDA. Is there an inbuilt way to share code in this manner in TensorFlow?
How does one define utility functions(usually inlined) that can run on the CPU and GPU?
It turns out that the following macro's in TensorFlow will do what I describe.
namespace tensorflow{
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
void foo() {
//
}
}

Is it possible to run piece of pure C++ code in GPU

I don't know OpenCL very much but I know C/C++ API requires programmer to provide OpenCL code as a string. But lately I discovered ArrayFire library that doesn't require string-code to invoke some calculations. I wondered how is it working (it is open source but the code is a bit confusing). Would it be possible to write parallel for with OpenCL backend that invokes any piece of compiled (x86 for example) code like following:
template <typename F>
void parallel_for(int starts, int ends, F task) //API
{ /*some OpenCL magic */ }
//...
parallel_for(0, 255, [&tab](int i){ tab[i] *= 0.7; } ); //using
PS: I know I am for 99% too optimistic
You cannot really call C++ Host code from the device using standard OpenCL.
You can use SYCL, the Khronos standard for single-source C++ programming. SYCL allows to compile C++ directly into device code without requiring the OpenCL strings. You can call any C++ function from inside a SYCL kernel (as long as the source code is available). SYCL.tech has more links and updated information.

Is it possible for invoke a kernel function within an another kernel function in CUDA?

Is it possible we invoke a __global__ function within another __global__ function which is also a kernel(__global__) in CUDA?
for example:
__global__ void func()
{
.
.
}
__global__ void foo()
{
.
.
func //this is a "func" function that has defination on the kernel
}
int main(void)
{
.
.
func <<<1, 1 >>>()
foo <<<1, 1 >>>()
}
And could it be use any function from thrust library in a __global__ function ?
Compute capability 3.5 and newer hardware support what is called Dynamic Parallelism, which gives them the ability to have kernels launched by running kernels on the GPU without requiring any host API calls.
Older hardware supports functions which can be called from kernels (these are denoted as __device__ instead of __global__) and are executed at thread scope only, so no new kernel is launched.
Since Thrust 1.8 was release, a serial execution policy has been introduced, which allows thrust algorithms to be call by threads within an existing running kernel, much like __device__functions. Thrust should also support dynamic parallelism via the thrust::device execution policy on supported hardware.

Use cpu function in cuda

I would like to include a C++ function in a CUDA Kernel, but this function is written for CPU like this:
inline float random(int rangeMin,int rangeMax){
return rand(rangeMin,rangeMax);
}
Assume that the rand() function use either curand.h or Thrust cuda library.
I thought in use a Kernel function (with only one GPU thread) that would include this function as inline, so the cuda compiler would generate the binary for the GPU.
Is this possible? If so I would like to include another inlines functions written for the cpu in the CUDA kernel function.
Something like this:
-- InlineCpuFunction.h and InlineCpuFunction.cpp
-- CudaKernel.cuh and CudaKernel.cu (this one include the above header and uses it's function in the CUDA kernel)
If you need some more explanation (as this may look confusing) please ask me.
You can tag the functions you want to use on both the device and the host with both the __host__ __device__ decorators that way it's compiled for your cpu and gpu.