Tensorflow GPU new op memory allocation - c++

I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data() should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA

When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.


CL_INVALID_ARG_SIZE when setting kernel arg

I'm making an attempt to break away from CUDA and learn OpenCL. I thought an n-body simulation might be a good place to start. I've been using the c++ wrapper, and following the tutorial provided here to get a basic idea of how things should work.
The program loads 2 source files, one for each kernel function. Each is compiled, and built into a separate kernel. On my first attempt, they were in the same kernel. This was an attempt to fix the issue by doing something different.
typedef struct body_t {
__kernel void execute(__global struct body_t* bodies, const float G, const int n){
typedef struct body_t {
__kernel void execute(__global struct body_t* bodies, const float dt, const int n){
The buffers are allocated as such:
// create memory buffers
bGravity = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(float));
bBodies = new cl::Buffer(*context, CL_MEM_READ_WRITE,
capacity * sizeof(body_t));
bTimestep = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(float));
bSize = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(int));
After the data is copied to the buffers, and it comes time to run the simulation, I set the kernel arguments as such:
fKernel->setArg(0, *bBodies);
fKernel->setArg(1, *bGravity);
fKernel->setArg(2, *bSize);
pKernel->setArg(0, *bBodies);
pKernel->setArg(1, *bTimestep);
pKernel->setArg(2, *bSize);
cl::NDRange global(capacity);
cl::NDRange local(1);
for (int step = 0; step < steps; step++) {
queue->enqueueNDRangeKernel(*fKernel, cl::NullRange, global, local);
queue->enqueueNDRangeKernel(*pKernel, cl::NullRange, global, local);
But on execution of the second line of the simulation function (setKernelArg(1, *bGravity)), the program terminates with CL_INVALID_ARG_SIZE. It seemed like it should be a trivial error to solve, but try as I might, I can't seem to find anything that would cause it.
I've tried passing different data types, including the types provided by opencl (cl_float, etc), but the problem remains. I'm sure I've just done something dumb, but I've been banging my head against the wall for the past few days to no avail.
In my attempt to keep this post short, if there's any critical code I neglected to include, everything can be found on the git repo here.
You have a mismatch between your kernel's expected arguments:
__kernel void execute(__global struct body_t* bodies, const float G, const int n)
(a buffer containing an array of struct body_t, and 2 scalar values)
and what you actually pass to it:
bGravity = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(float));
bBodies = new cl::Buffer(*context, CL_MEM_READ_WRITE,
capacity * sizeof(body_t));
bSize = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(int));
fKernel->setArg(0, *bBodies);
fKernel->setArg(1, *bGravity);
fKernel->setArg(2, *bSize);
G and n should not be passed as buffers. Instead, the following should do the trick:
const float gravity = …;
const int32_t size = …;
fKernel->setArg(1, gravity);
fKernel->setArg(2, size);

Device pointer in a device class (Cuda C++)

I would like to implement a device side vector class which encapsulates a pointer to the elements of the container.
After I instantiate an object of this class I have no access to the inside pointer. It always says 'Access violation writing location some device memory address'.
My code is the following:
#include <iostream>
#include <cuda_runtime.h>
template <typename T>
class DeviceVector
T* m_bValues;
std::size_t m_bSize;
void* operator new(std::size_t size)
DeviceVector<T>* object = nullptr;
cudaMalloc((void**)&object, size);
return object;
void operator delete(void* object)
DeviceVector(std::size_t size = 1)
cudaMemcpy(&m_bSize, &size, sizeof(std::size_t), cudaMemcpyHostToDevice);
// At this cudaMalloc I get Access violation writing location...
cudaMalloc((void**)&m_bValues, size * sizeof(T));
// It's an alternative solution here
T* ptr;
cudaMalloc((void**)&ptr, size * sizeof(T));
cudaMemcpy(&m_bValues, &ptr, sizeof(T*), cudaMemcpyHostToDevice);
// The memory is allocated
// But I can't access it through m_bValues pointer
// It is also Access violation writing location...
// Access violation here if I use the second solution in the constructor
int main()
DeviceVector<int>* vec = new DeviceVector<int>();
delete vec;
return 0;
I have access to the size attribute.
So my questions are:
How to allocate memory for this class to get access to the pointer inside?
Is this even possible to encapsulate a pointer into a class on the device?
This line is illegal:
cudaMalloc((void**)&m_bValues, size * sizeof(T));
because your new operator allocated the object on the device:
cudaMalloc((void**)&object, size);
return object;
and the constructor was called to operate on that allocation. Therefore &m_bValues is taking the address of a device variable in host code which is illegal in CUDA. If you do that, and then attempt to use it in host code (i.e. the cudaMalloc operation), you're going to get a seg fault. cudaMalloc creates a device allocation of a particular size, and then stores the device pointer to that allocation in a variable that is expected to be resident on the host. If you pass it a device address to store that pointer into instead, cudaMalloc will segfault trying to write the pointer value.
Your alternative solution is a somewhat better approach, and is the general idea when it's necessary to copy a pointer to a device allocation to a variable resident on the device.
But you've still basically made the allocation that m_bValues points to inaccessible from the host. (ptr, being a temporary variable, won't help, and creating another variable in the class to hold a value like ptr won't help either because the entire class is allocated and resident on the device.) For the same reason that you're not allowed to use &m_bValues in the previous cudaMalloc operation, you won't be able to use it directly in any other host code (except as the target for cudaMempcy host->device when copying the pointer value itself).
I don't think there are any simple fixes for this. I suggest re-crafting the object to live on the host, and provide appropriate host- and device-side allocations for corresponding pointers and parameters (like size).
It also seems like you're re-inventing the wheel. You might want to investigate thrust device vectors (which are easily usable with ordinary CUDA code.)
Anyway, this was the closest I could come up with:
#include <iostream>
#include <cuda_runtime.h>
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
template <typename T>
class DeviceVector
T* m_bValues;
std::size_t m_bSize;
std::size_t eleSize;
void* operator new(std::size_t size)
DeviceVector<T>* object = NULL;
object = (DeviceVector<T> *)malloc(size*sizeof(DeviceVector<T>));
return object;
void operator delete(void* object)
DeviceVector(std::size_t size = 1)
m_bSize = size;
eleSize = sizeof(T);
cudaMalloc(&m_bValues, m_bSize*sizeof(T));
cudaCheckErrors("constructor cudaMalloc fail");
cudaMemset(m_bValues, 0, m_bSize*sizeof(T));
cudaCheckErrors("destructor cudaFree fail");
T* getDevPtr(){
return m_bValues;}
std::size_t getSize(){
return m_bSize;}
std::size_t geteleSize(){
return eleSize;}
int main()
DeviceVector<int>* vec = new DeviceVector<int>();
cudaMemset(vec->getDevPtr(), 0xFF, vec->getSize()*vec->geteleSize());
cudaCheckErrors("vector fill fail");
delete vec;
return 0;
You've shown very little about how you want to interact with an object of this class, so I'm just guessing here.

How to access a class from one cuda kernel in the next kernel

I have a dev variable which I used to allocate space on the device using a class header.
Neu *dev_NN;
cudaStatus = cudaMalloc((void**)&dev_NN, sizeof(Neu));
Then I call a kernel which initialises the class on the GPU.
KGNN<<<1, threadsPerBlock>>>(dev_LaySze, dev_NN);
in the kernel
__global__ void KGNN(int * dev_LaySze, Neu * NN)
NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
After the return of this kernel I want to use another kernel to input data to class methods and retrieve output data (the allocators and copies are already done and work), such as
__global__ void KGFF(double *dev_inp, double *dev_outp, int *DataSize)
int i = threadIdx.x;
NN.Analyse(dev_inp, dev_outp, DataSize );
The second kernel knows nothing about the class that was created. As you would expect NN is unrecognised. How do I access the first NN without re-creating the class and re-initialising it? The second kernel has to be called several times, remembering the changes it made to the class variables earlier. I don't want to use the class with the CPU, only the GPU, and I don't want to pass it back and forth each time.
I don't think this has anything to do with CUDA, actually. I believe a similar problem would be observed if you tried this in ordinary C++ (assuming the pointer to NN is not a global variable).
The key aspect of the solution as pointed out by Park Young-Bae is simply to pass the pointer to the allocated space for NN to both kernels. There were a few other changes that I think needed to be made to what you have shown, according to my understanding of what you are trying to do (since you haven't posted a complete code.) Here's a fully worked example:
$ cat t635.cu
#include <stdio.h>
class MC {
int md;
__host__ __device__ int get_md() { return md;}
__host__ __device__ MC(int val) { md = val; }
__global__ void kernel1(MC *d){
*d = MC(3);
__global__ void kernel2(MC *d){
printf("val = %d\n", d->get_md());
int main(){
MC *d_obj;
cudaMalloc(&d_obj, sizeof(MC));
return 0;
$ nvcc -arch=sm_20 -o t635 t635.cu
$ ./t635
val = 3
The other changes I suggest:
in your first kernel, you're passing a pointer (NN) (which presumably you have made a device allocation for), and then you are creating an opject and copying that object to the allocated space. In that case I think you need:
*NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
in your second kernel, if NN is a pointer, we must use:
NN->Analyse(dev_inp, dev_outp, DataSize );
I have made those two changes to my posted example. Again, I think this is all just C++ mechanics, not anything specific to CUDA.

C++ data structures and CUDA

i have a structure which can be
struct type1{ double a,b,c;}
or it can be
struct type2{ double a,b,c,d,e;}
in my host function of cuda code i have someting like
void compute(){
// some code
// data on devices (up to 10)
type *xxx[10]; // this is where i want either type1 or type2 structures
// the "type" is not known at compile time but i want to
// determine at runtime based on some input variable. this
// part is not real code rather this is what i want to achive.
int DevUsed; // some code to give value to int DevUsed
for(int idev=0;idev<DevUsed;idev++){
// set cuda device
if ( cudaMalloc(&xxx[iDev], sizeof(type)) != cudaSuccess )
// print error message;
cudaMemcpy(xxx[iDev], pIF1, sizeof(type), cudaMemcpyHostToDevice);
function2<<<grid, block>>>(xxx[iDev]); // where function2 is the kernel
My question is what is a way to select between type1 and type2 data struct with generic code like "type *xxx[10];"
C++ template is designed for this situation.
template <class T>
void compute(){
// some code
// data on devices (up to 10)
T xxx[10]; // this is where i want either type1 or type2 structures
// the "type" is not known at compile time but i want to
// determine at runtime based on some input variable. this
// part is not real code rather this is what i want to achive.
int DevUsed; // some code to give value to int DevUsed
for(int idev=0;idev<DevUsed;idev++){
// set cuda device
if ( cudaMalloc(&xxx[iDev], sizeof(T)) != cudaSuccess )
// print error message;
cudaMemcpy(xxx[iDev], pIF1, sizeof(T), cudaMemcpyHostToDevice);
function2<<<grid, block>>>(xxx[iDev]); // where function2 is the kernel
Please note that you will also need a kernel template for these two types like
template <class T>
__global__ void function2(T x)

CUDA function call-able by either the device or host

I have a re-useable function in some CUDA code that needs to be called from both the device and the host. Is there an appropriate qualifier for this?
e.g. what's the correct definition for func1 in this case:
int func1 (int a, int b) {
return a+b;
__global__ devicecode (float *A) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
A[i] = func1(i,i);
void main() {
// Normal cuda memory set-up
// Call func1 from inside main:
int j = func1(2,4)
// Normal cuda memory copy / program run / retrieve data
So far I can only get this to work by having the function twice: once explicitly for the device and once for the host. Is there a better way?
From the CUDA Programming Guide:
The __device__ and __host__ qualifiers can be used together however, in
which case the function is compiled for both the host and the device.