I would like to include a C++ function in a CUDA Kernel, but this function is written for CPU like this:
inline float random(int rangeMin,int rangeMax){
return rand(rangeMin,rangeMax);
}
Assume that the rand() function use either curand.h or Thrust cuda library.
I thought in use a Kernel function (with only one GPU thread) that would include this function as inline, so the cuda compiler would generate the binary for the GPU.
Is this possible? If so I would like to include another inlines functions written for the cpu in the CUDA kernel function.
Something like this:
-- InlineCpuFunction.h and InlineCpuFunction.cpp
-- CudaKernel.cuh and CudaKernel.cu (this one include the above header and uses it's function in the CUDA kernel)
If you need some more explanation (as this may look confusing) please ask me.
You can tag the functions you want to use on both the device and the host with both the __host__ __device__ decorators that way it's compiled for your cpu and gpu.
Related
I am writing an Op in C++ and CUDA for TensorFlow that has shared custom function code. Usually when code sharing between CPU and CUDA implementations, one would define a macro to insert the __device__ specifier into the function signature, if compiling for CUDA. Is there an inbuilt way to share code in this manner in TensorFlow?
How does one define utility functions(usually inlined) that can run on the CPU and GPU?
It turns out that the following macro's in TensorFlow will do what I describe.
namespace tensorflow{
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
void foo() {
//
}
}
I don't know OpenCL very much but I know C/C++ API requires programmer to provide OpenCL code as a string. But lately I discovered ArrayFire library that doesn't require string-code to invoke some calculations. I wondered how is it working (it is open source but the code is a bit confusing). Would it be possible to write parallel for with OpenCL backend that invokes any piece of compiled (x86 for example) code like following:
template <typename F>
void parallel_for(int starts, int ends, F task) //API
{ /*some OpenCL magic */ }
//...
parallel_for(0, 255, [&tab](int i){ tab[i] *= 0.7; } ); //using
PS: I know I am for 99% too optimistic
You cannot really call C++ Host code from the device using standard OpenCL.
You can use SYCL, the Khronos standard for single-source C++ programming. SYCL allows to compile C++ directly into device code without requiring the OpenCL strings. You can call any C++ function from inside a SYCL kernel (as long as the source code is available). SYCL.tech has more links and updated information.
In the LLVM source code folder we can see the intrinsic cvta_shared_yes, cvta_shared_yes_64, cvta_to_shared_yes_64 similarly for other memory types like global, local, constant etc. What is the purpose of this. Is it defining the behavior of memory types ? If so can we add a new intrinsic ?
These intrinsics are used by the NVPTX backend for emitting PTX special operations converting pointers to the global, local, shared or constant memory to the generic address space and back.
This is NVPTX backend specific and represents the memory hierarchy on an Nvidia (CUDA) GPU.
If you want to add intrinsics to LLVM have a look at llvm/include/llvm/IR/Intrinsics*.td TableGen files. These files are used to generate everything necessary for an intrinsic.
For example:
def int_memcpy : Intrinsic<[],
[llvm_anyptr_ty, llvm_anyptr_ty, llvm_anyint_ty,
llvm_i32_ty, llvm_i1_ty],
[IntrReadWriteArgMem, NoCapture<0>, NoCapture<1>,
ReadOnly<1>]>;
will generate the llvm.memcpy intrinsic, which can be used by a backend to generate calls to memcpy functions of a specific systems.
However, keep in mind that the backend has to support your new intrinsics somehow. You can look at ./llvm/lib/Target/X86/X86ISelLowering.cpp how the X86 backend handles the llvm.memcpy intrinsic.
I am porting CUDA code to OpenCL - CUDA allows C++ constructs like templates while OpenCL is strictly C99. So, what is the most painless way of porting templatest to C?
I thought of using function pointers for the template parameters.
Before there were templates, there were preprocessor macros.
Search the web for "generic programming in C" for inspiration.
Here is the technique I used for conversion of some of CUDA algorithms from Modern GPU code to my GPGPU VexCL library (with OpenCL support).
Each template function in CUDA code is converted to two template functions in OpenCL host code. The first host function ('name' function) returns mangled name of the generated OpenCL function (so that functions with different template parameters have different names); the second host function ('source' function) returns the string representation of the generated OpenCL function source code. These functions are then used for generation of the main kernel code.
Take, for example, the CTAMergeSort CUDA function template. It gets converted to the two overloads of merge_sort function in VexCL code. I call the 'source' function in order to add the function definition to the OpenCL kernel source here and then use the 'name' function to add its call to the kernel here.
Note that the backend::source_generator in VexCL is used in order to generate either OpenCL or CUDA code transparently. In your case the code generation could be much simpler.
To make it all a bit more clear, here is the code that gets generated for the mergesort<256,11,int,float> template instance:
void mergesort_256_11_int_float
(
int count,
int tid,
int * thread_keys0,
local int * keys_shared0,
float * thread_vals0,
local float * vals_shared0
)
{
if(11 * tid < count) odd_even_transpose_sort_11_int_float(thread_keys0, thread_vals0);
thread_to_shared_11_int(thread_keys0, tid, keys_shared0);
block_sort_loop_256_11_int_float(tid, count, keys_shared0, thread_vals0, vals_shared0);
}
Take a look at Boost.Compute. It provides a C++, STL-like API for OpenCL.
I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:
__device__ __constant__ CaseParams deviceCaseParams;
I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:
void copyMetaData(CaseParams* caseParams)
{
cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}
which works.
Anyways, it seems (by trial and error, and also from reading posts on the net) that for some sick reason, the declaration of deviceCaseParams and the copy operation of it (the call to cudaMemcpyToSymbol) must be in the same file. At the moment I have these two in a .cu file, but I really want to have the parameter struct in a .cuh file so that any implementation could see it if it wants to. That means that I also have to have the copyMetaData function in the a header file, but this messes up linking (symbol already defined) since both .cpp and .cu files include this header (and thus both the MS C++ compiler and nvcc compiles it).
Does anyone have any advice on design here?
Update: See the comments
With an up-to-date CUDA (e.g. 3.2) you should be able to do the memcpy from within a different translation unit if you're looking up the symbol at runtime (i.e. by passing a string as the first arg to cudaMemcpyToSymbol as you are in your example).
Also, with Fermi-class devices you can just malloc the memory (cudaMalloc), copy to the device memory, and then pass the argument as a const pointer. The compiler will recognise if you are accessing the data uniformly across the warps and if so will use the constant cache. See the CUDA Programming Guide for more info. Note: you would need to compile with -arch=sm_20.
If you're using pre-Fermi CUDA, you will have found out by now that this problem doesn't just apply to constant memory, it applies to anything you want on the CUDA side of things. The only two ways I have found around this are to either:
Write everything CUDA in a single file (.cu), or
If you need to break out code into separate files, restrict yourself to headers which your single .cu file then includes.
If you need to share code between CUDA and C/C++, or have some common code you share between projects, option 2 is the only choice. It seems very unnatural to start with, but it solves the problem. You still get to structure your code, just not in a typically C like way. The main overhead is that every time you do a build you compile everything. The plus side of this (which I think is possibly why it works this way) is that the CUDA compiler has access to all the source code in one hit which is good for optimisation.