I don't know OpenCL very much but I know C/C++ API requires programmer to provide OpenCL code as a string. But lately I discovered ArrayFire library that doesn't require string-code to invoke some calculations. I wondered how is it working (it is open source but the code is a bit confusing). Would it be possible to write parallel for with OpenCL backend that invokes any piece of compiled (x86 for example) code like following:
template <typename F>
void parallel_for(int starts, int ends, F task) //API
{ /*some OpenCL magic */ }
//...
parallel_for(0, 255, [&tab](int i){ tab[i] *= 0.7; } ); //using
PS: I know I am for 99% too optimistic
You cannot really call C++ Host code from the device using standard OpenCL.
You can use SYCL, the Khronos standard for single-source C++ programming. SYCL allows to compile C++ directly into device code without requiring the OpenCL strings. You can call any C++ function from inside a SYCL kernel (as long as the source code is available). SYCL.tech has more links and updated information.
Related
I am writing an Op in C++ and CUDA for TensorFlow that has shared custom function code. Usually when code sharing between CPU and CUDA implementations, one would define a macro to insert the __device__ specifier into the function signature, if compiling for CUDA. Is there an inbuilt way to share code in this manner in TensorFlow?
How does one define utility functions(usually inlined) that can run on the CPU and GPU?
It turns out that the following macro's in TensorFlow will do what I describe.
namespace tensorflow{
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
void foo() {
//
}
}
I'm writing some C code using IOKit, and need to use IOMemoryDescriptor methods. Unfortunately, I can only compile pure C sources, and that is a C++ class. So, I'm asking if there is some C interface that lets me perform the same operations.
Specifically, I want a function that does pretty much this, but that can be compiled as C:
#include <IOKit/IOMemoryDescriptor.h>
extern "C" void CopyOut(mach_vm_address_t src, void *dst, size_t size)
{
IOMemoryDescriptor *memDesc;
memDesc = IOMemoryDescriptor::withAddressRange(src, size, kIODirectionOut, current_task());
// Error checking removed for brevity
memDesc->prepare();
memDesc->readBytes(0, dst, size);
memDesc->complete();
memDesc->release();
}
Being based on BSD, xnu has inherited some of BSD's kernel APIs, including the copyin and copyout functions. They are declared in libkern.h, and they do pretty much what you're using an IOMemoryDescriptor for, but nothing else.
You do mention you're using IOKit - if you need anything beyond this out of IOKit's functionality, you'll pretty much have to go with a C++ compiler, or use C to call mangled names directly.
If you're new to using a weird compiler for building kexts, I'll just warn you that kernel code for x86_64 must not use the red zone of the stack, as that can't exist due to interrupt handling. If your compiler assumes a red zone is present, you'll get bizarre crashes. Clang and gcc have corresponding flags for disabling the red zone. (-mno-red-zone, if I remember correctly, automatically activated via the kernel mode flag) Even if you're using a non-official compiler, linking against an object file built with clang's C++ compiler at the last stage should work fine for wrapping any other C++ APIs.
The current OpenCL C++ bindings in CL/cl.hpp are a very thin wrapper over the C OpenCL API. I understand reasons why it was done this way, although I actually really don't.
Are there any existing alternative wrappers which rely on exceptions as error handling, allowing one to just write code like this:
auto platform_list = cl::Platform::get();
because, well, RVO and readability and such, instead of the current
std::vector<cl::Platform> platform_list;
auto error = cl::Platform::get(&platformList);
if(error != CL_SUCCESS)
Or if one opts in on exception handling (by defining __CL_ENABLE_EXCEPTIONS):
std::vector<cl::Platform> platform_list;
cl::Platform::get(&platformList);
Note the actual error handling code is not shown, although in the non-exceptions case this can get quite messy.
I'm sure such bindings would not be terribly hard to write, but edge cases remain edge cases and I'd prefer a solid pre-written wrapper. Call me spoiled, but if C++ bindings do not offer a real C++ interface, I don't really see the point of them.
Check out the Boost.Compute library. It is header-only and provides a high-level C++ API for GPGPU/parallel-computing based on OpenCL.
Getting the list of platforms looks like this:
for(auto platform : boost::compute::system::platforms()){
std::cout << platform.vendor() << std::endl;
}
And it uses exceptions for error handling (which vastly reduces the amount of explicit checking required and gives much nicer error messages on failure):
try {
// attempt to compile to program
program.build();
}
catch(boost::compute::opencl_error &e){
// program failed to compile, print out the build log
std::cout << program.build_log() << std::endl;
}
On top of all that it also offers a STL-like interface with containers like vector<T> and array<T, N> as well as algorithms like sort() and transform() (along with other features like random number generation and lambda expression support).
For example, to sort a vector of floats on the device you just:
// vector of floats on the device
boost::compute::vector<float> vec = ...;
// sort the vector
boost::compute::sort(vec.begin(), vec.end(), queue);
// copy the sorted vector back to the host
boost::compute::copy(vec.begin(), vec.end(), host_vec.begin(), queue);
There are more tutorials and examples in the documentation.
The C++ wrappers are designed to be just a thin layer on top of OpenCL so they can be included just as a header file. There are some C++/OpenCL libraries that offer various kinds of support for C++, such as AMD Bolt.
There is a proposal for a layer/library for C++, SYCL. It is slightly more complex than a wrapper, as it requires a device compiler to produce OpenCL kernels, but provides (IMHO) nice abstractions and exception handling.
The provisional specification is already available , and there is already a (work in progress) open source implementation.
I am porting CUDA code to OpenCL - CUDA allows C++ constructs like templates while OpenCL is strictly C99. So, what is the most painless way of porting templatest to C?
I thought of using function pointers for the template parameters.
Before there were templates, there were preprocessor macros.
Search the web for "generic programming in C" for inspiration.
Here is the technique I used for conversion of some of CUDA algorithms from Modern GPU code to my GPGPU VexCL library (with OpenCL support).
Each template function in CUDA code is converted to two template functions in OpenCL host code. The first host function ('name' function) returns mangled name of the generated OpenCL function (so that functions with different template parameters have different names); the second host function ('source' function) returns the string representation of the generated OpenCL function source code. These functions are then used for generation of the main kernel code.
Take, for example, the CTAMergeSort CUDA function template. It gets converted to the two overloads of merge_sort function in VexCL code. I call the 'source' function in order to add the function definition to the OpenCL kernel source here and then use the 'name' function to add its call to the kernel here.
Note that the backend::source_generator in VexCL is used in order to generate either OpenCL or CUDA code transparently. In your case the code generation could be much simpler.
To make it all a bit more clear, here is the code that gets generated for the mergesort<256,11,int,float> template instance:
void mergesort_256_11_int_float
(
int count,
int tid,
int * thread_keys0,
local int * keys_shared0,
float * thread_vals0,
local float * vals_shared0
)
{
if(11 * tid < count) odd_even_transpose_sort_11_int_float(thread_keys0, thread_vals0);
thread_to_shared_11_int(thread_keys0, tid, keys_shared0);
block_sort_loop_256_11_int_float(tid, count, keys_shared0, thread_vals0, vals_shared0);
}
Take a look at Boost.Compute. It provides a C++, STL-like API for OpenCL.
I would like to include a C++ function in a CUDA Kernel, but this function is written for CPU like this:
inline float random(int rangeMin,int rangeMax){
return rand(rangeMin,rangeMax);
}
Assume that the rand() function use either curand.h or Thrust cuda library.
I thought in use a Kernel function (with only one GPU thread) that would include this function as inline, so the cuda compiler would generate the binary for the GPU.
Is this possible? If so I would like to include another inlines functions written for the cpu in the CUDA kernel function.
Something like this:
-- InlineCpuFunction.h and InlineCpuFunction.cpp
-- CudaKernel.cuh and CudaKernel.cu (this one include the above header and uses it's function in the CUDA kernel)
If you need some more explanation (as this may look confusing) please ask me.
You can tag the functions you want to use on both the device and the host with both the __host__ __device__ decorators that way it's compiled for your cpu and gpu.