TensorFlow CPU and CUDA code sharing - c++

I am writing an Op in C++ and CUDA for TensorFlow that has shared custom function code. Usually when code sharing between CPU and CUDA implementations, one would define a macro to insert the __device__ specifier into the function signature, if compiling for CUDA. Is there an inbuilt way to share code in this manner in TensorFlow?
How does one define utility functions(usually inlined) that can run on the CPU and GPU?

It turns out that the following macro's in TensorFlow will do what I describe.
namespace tensorflow{
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
void foo() {
//
}
}

Related

Is it possible to run piece of pure C++ code in GPU

I don't know OpenCL very much but I know C/C++ API requires programmer to provide OpenCL code as a string. But lately I discovered ArrayFire library that doesn't require string-code to invoke some calculations. I wondered how is it working (it is open source but the code is a bit confusing). Would it be possible to write parallel for with OpenCL backend that invokes any piece of compiled (x86 for example) code like following:
template <typename F>
void parallel_for(int starts, int ends, F task) //API
{ /*some OpenCL magic */ }
//...
parallel_for(0, 255, [&tab](int i){ tab[i] *= 0.7; } ); //using
PS: I know I am for 99% too optimistic
You cannot really call C++ Host code from the device using standard OpenCL.
You can use SYCL, the Khronos standard for single-source C++ programming. SYCL allows to compile C++ directly into device code without requiring the OpenCL strings. You can call any C++ function from inside a SYCL kernel (as long as the source code is available). SYCL.tech has more links and updated information.

Is it possible for invoke a kernel function within an another kernel function in CUDA?

Is it possible we invoke a __global__ function within another __global__ function which is also a kernel(__global__) in CUDA?
for example:
__global__ void func()
{
.
.
}
__global__ void foo()
{
.
.
func //this is a "func" function that has defination on the kernel
}
int main(void)
{
.
.
func <<<1, 1 >>>()
foo <<<1, 1 >>>()
}
And could it be use any function from thrust library in a __global__ function ?
Compute capability 3.5 and newer hardware support what is called Dynamic Parallelism, which gives them the ability to have kernels launched by running kernels on the GPU without requiring any host API calls.
Older hardware supports functions which can be called from kernels (these are denoted as __device__ instead of __global__) and are executed at thread scope only, so no new kernel is launched.
Since Thrust 1.8 was release, a serial execution policy has been introduced, which allows thrust algorithms to be call by threads within an existing running kernel, much like __device__functions. Thrust should also support dynamic parallelism via the thrust::device execution policy on supported hardware.

need to convert C++ template to C99 code

I am porting CUDA code to OpenCL - CUDA allows C++ constructs like templates while OpenCL is strictly C99. So, what is the most painless way of porting templatest to C?
I thought of using function pointers for the template parameters.
Before there were templates, there were preprocessor macros.
Search the web for "generic programming in C" for inspiration.
Here is the technique I used for conversion of some of CUDA algorithms from Modern GPU code to my GPGPU VexCL library (with OpenCL support).
Each template function in CUDA code is converted to two template functions in OpenCL host code. The first host function ('name' function) returns mangled name of the generated OpenCL function (so that functions with different template parameters have different names); the second host function ('source' function) returns the string representation of the generated OpenCL function source code. These functions are then used for generation of the main kernel code.
Take, for example, the CTAMergeSort CUDA function template. It gets converted to the two overloads of merge_sort function in VexCL code. I call the 'source' function in order to add the function definition to the OpenCL kernel source here and then use the 'name' function to add its call to the kernel here.
Note that the backend::source_generator in VexCL is used in order to generate either OpenCL or CUDA code transparently. In your case the code generation could be much simpler.
To make it all a bit more clear, here is the code that gets generated for the mergesort<256,11,int,float> template instance:
void mergesort_256_11_int_float
(
int count,
int tid,
int * thread_keys0,
local int * keys_shared0,
float * thread_vals0,
local float * vals_shared0
)
{
if(11 * tid < count) odd_even_transpose_sort_11_int_float(thread_keys0, thread_vals0);
thread_to_shared_11_int(thread_keys0, tid, keys_shared0);
block_sort_loop_256_11_int_float(tid, count, keys_shared0, thread_vals0, vals_shared0);
}
Take a look at Boost.Compute. It provides a C++, STL-like API for OpenCL.

How to compute exponential of a matrix inside CUDA thread?

I need somehow to be able to compute the exponential of a matrix inside a CUDA kernel. Is there any library whose function for this task could be called from within CUDA thread? Or maybe would it be possible to implement this function from scratch as __device__ function?
I am using Microsoft Visual Studio 2008 Express for host code compilation and nvcc compiler from toolkit 3.2v.
GPU: NVIDIA GeForce GT640 (compute capability 3.0)
No there's no such things in CUDA library but you might look at this code to help you designing a solution in CUDA:
https://github.com/poliu2s/MKL/blob/master/matrix_exponential.cpp
If you are working on an architecture 3.5, it could be easier to solve your problem (with dynamic paralleslism) by calling a __global__ kernel from an other __global__ kernel without returning on the host so you can set the configuration you want to execute it (threads and blocks).
Basically:
__global__ child( ... )
{
....
}
__global__ parent( ... )
{
child<<< ..., ... >>>( ... )
}
Hope this can help

Use cpu function in cuda

I would like to include a C++ function in a CUDA Kernel, but this function is written for CPU like this:
inline float random(int rangeMin,int rangeMax){
return rand(rangeMin,rangeMax);
}
Assume that the rand() function use either curand.h or Thrust cuda library.
I thought in use a Kernel function (with only one GPU thread) that would include this function as inline, so the cuda compiler would generate the binary for the GPU.
Is this possible? If so I would like to include another inlines functions written for the cpu in the CUDA kernel function.
Something like this:
-- InlineCpuFunction.h and InlineCpuFunction.cpp
-- CudaKernel.cuh and CudaKernel.cu (this one include the above header and uses it's function in the CUDA kernel)
If you need some more explanation (as this may look confusing) please ask me.
You can tag the functions you want to use on both the device and the host with both the __host__ __device__ decorators that way it's compiled for your cpu and gpu.