Can't use my template class in cuda kernel - c++

I thought I knew how to write some clean cuda code. Until I tried to make a simple template class and use it in a simple kernel.
I've been trouble shooting for days. Every single thread I've visited made me feel a little more stupid.
For error checking I used this
Here is my class.h:
#pragma once
template <typename T>
class MyArray
{
public:
const int size;
T *data;
__host__ MyArray(int size); //gpuErrchk(cudaMalloc(&data, size * sizeof(T)));
__device__ __host__ T GetValue(int); //return data[i]
__device__ __host__ void SetValue(T, int); //data[i] = val;
__device__ __host__ T& operator()(int); //return data[i];
~MyArray(); //gpuErrchk(cudaFree(data));
};
template class MyArray<double>;
The relevant content of class.cu is in the comments. If you think the whole thing is relevant Id be happy to add it.
Now for the main class:
__global__ void test(MyArray<double> array, double *data, int size)
{
int j = threadIdx.x;
//array.SetValue(1, j); //doesn't work
//array(j) = 1; //doesn't work
//array.data[j] = 1; //doesn't work
data[j] = 1; //This does work !
printf("Reach this code\n");
}
}
int main(int argc, char **argv)
{
MyArray x(20);
test<<<1, 20>>>(x, x.data, 20);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
}
When I say "doesn't work", I mean that the program stops there (before reaching the printf) without outputting any error. Plus I get the following error both from cudaDeviceSynchronize and from cudaFree:
an illegal memory access was encountered
What I can't understand is that there should be no issue with memory management since sending the array directly to the kernel works fine. So why doesn't it work when I send a class and try to access the classes data? And why do I receive no warning or error message when clearly my code bumped into some error?
Here is the output of nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

(Editorial note: there is quite a bit of disinformation in the comments on this question, so I have assembled an answer as a community wiki entry.)
There is no particular reason why a template class cannot be passed as an argument to a kernel. There are some limitations which need to be clearly understood before doing so:
CUDA kernel arguments, for all intent and purpose, always passed by value. Pass by reference is supported under extremely limited set of circumstances (the argument in question must be stored in managed memory). This does not apply here.
As a result of (1), POD arguments just work, because they are trivially copyable and rely on no special behaviour
Classes are different, in that when you pass a class by value, you are implicitly invoking copy construction or move construction semantics. That means that classes passed by value as kernel arguments must be trivially copy constructable. There is no way to run non trivial copy constructors on the device as part of a kernel launch.
CUDA further requires that classes don't contain virtual members
Although the <<< >>> kernel launch syntax looks like a simple function call, it isn't. There is several layers of abstraction boilerplate and a API call between what you write in host code and what actually is emitted by the toolchain on the host side. This means that there are several copy construction operations between your code and the GPU. If you do something like put a cudaFree call in your destructor, you should assume that it will get called as part of the function call sequence which launches a kernel when one of those copies falls out of scope. You do not want that.
You did not show how the class member functions were actually implemented in this case, so saying why one of the many permutations your code comments hinted at did or did not work is impossible, beyond passing the raw pointer to the kernel, which works because it is a trivially copyable POD value, when the class was almost certainly not.
Here is a simple, complete example showing how to make this work:
$cat classy.cu
#include <vector>
#include <iostream>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
template <typename T>
class MyArray
{
public:
int len;
T *data;
__device__ __host__ void SetValue(T val, int i) { data[i] = val; };
__device__ __host__ int size() { return sizeof(T) * len; };
__host__ void DevAlloc(int N) {
len = N;
gpuErrchk(cudaMalloc(&data, size()));
};
__host__ void DevFree() {
gpuErrchk(cudaFree(data));
len = -1;
};
};
__global__ void test(MyArray<double> array, double val)
{
int j = threadIdx.x;
if (j < array.len)
array.SetValue(val, j);
}
int main(int argc, char **argv)
{
const int N = 20;
const double val = 5432.1;
gpuErrchk(cudaSetDevice(0));
gpuErrchk(cudaFree(0));
MyArray<double> x;
x.DevAlloc(N);
test<<<1, 32>>>(x, val);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
std::vector<double> y(N);
gpuErrchk(cudaMemcpy(&y[0], x.data, x.size(), cudaMemcpyDeviceToHost));
x.DevFree();
for(int i=0; i<N; ++i) std::cout << i << " = " << y[i] << std::endl;
return 0;
}
which compiles and runs like so:
$ nvcc -std=c++11 -arch=sm_53 -o classy classy.cu
$ cuda-memcheck ./classy
========= CUDA-MEMCHECK
0 = 5432.1
1 = 5432.1
2 = 5432.1
3 = 5432.1
4 = 5432.1
5 = 5432.1
6 = 5432.1
7 = 5432.1
8 = 5432.1
9 = 5432.1
10 = 5432.1
11 = 5432.1
12 = 5432.1
13 = 5432.1
14 = 5432.1
15 = 5432.1
16 = 5432.1
17 = 5432.1
18 = 5432.1
19 = 5432.1
========= ERROR SUMMARY: 0 errors
(CUDA 10.2/gcc 7.5 on a Jetson Nano)
Note that I have included host side functions for allocation and deallocation which do not interact with the constructor and destructor. Otherwise the class is extremely similar to your design and has the same properties.

Related

CUDA Illegal memory access when using virtual class [duplicate]

This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector<BaseClass> if I want it to store objects of different types DerivedClass1, DerivedClass2, etc, simultaneously?
I want to take advantage of polymorphism with CUDA Thrust. I'm compiling for an -arch=sm_30 GPU (GeForce GTX 670).
Let us take a look at the following problem: Suppose there are 80 families in town. 60 of them are married couples, 20 of them are single-parent households. Each family has, therefore, a different number of members. It's census time and households have to state the parents' ages and the number of children they have. Therefore, an array of Family objects is constructed by the government, namely thrust::device_vector<Family> familiesInTown(80), such that information of families familiesInTown[0] to familiesInTown[59] corresponds to married couples, the rest (familiesInTown[60] to familiesInTown[79]) being single-parent households.
Family is the base class - the number of parents in the household (1 for single parents and 2 for couples) and the number of children they have are stored here as members.
SingleParent, derived from Family, includes a new member - the single parent's age, unsigned int ageOfParent.
MarriedCouple, also derived from Family, however, introduces two new members - both parents' ages, unsigned int ageOfParent1 and unsigned int ageOfParent2.
#include <iostream>
#include <stdio.h>
#include <thrust/device_vector.h>
class Family
{
protected:
unsigned int numParents;
unsigned int numChildren;
public:
__host__ __device__ Family() {};
__host__ __device__ Family(const unsigned int& nPars, const unsigned int& nChil) : numParents(nPars), numChildren(nChil) {};
__host__ __device__ virtual ~Family() {};
__host__ __device__ unsigned int showNumOfParents() {return numParents;}
__host__ __device__ unsigned int showNumOfChildren() {return numChildren;}
};
class SingleParent : public Family
{
protected:
unsigned int ageOfParent;
public:
__host__ __device__ SingleParent() {};
__host__ __device__ SingleParent(const unsigned int& nChil, const unsigned int& age) : Family(1, nChil), ageOfParent(age) {};
__host__ __device__ unsigned int showAgeOfParent() {return ageOfParent;}
};
class MarriedCouple : public Family
{
protected:
unsigned int ageOfParent1;
unsigned int ageOfParent2;
public:
__host__ __device__ MarriedCouple() {};
__host__ __device__ MarriedCouple(const unsigned int& nChil, const unsigned int& age1, const unsigned int& age2) : Family(2, nChil), ageOfParent1(age1), ageOfParent2(age2) {};
__host__ __device__ unsigned int showAgeOfParent1() {return ageOfParent1;}
__host__ __device__ unsigned int showAgeOfParent2() {return ageOfParent2;}
};
If I were to naïvely initiate the objects in my thrust::device_vector<Family> with the following functors:
struct initSlicedCouples : public thrust::unary_function<unsigned int, MarriedCouple>
{
__device__ MarriedCouple operator()(const unsigned int& idx) const
// I use a thrust::counting_iterator to get idx
{
return MarriedCouple(idx % 3, 20 + idx, 19 + idx);
// Couple 0: Ages 20 and 19, no children
// Couple 1: Ages 21 and 20, 1 child
// Couple 2: Ages 22 and 21, 2 children
// Couple 3: Ages 23 and 22, no children
// etc
}
};
struct initSlicedSingles : public thrust::unary_function<unsigned int, SingleParent>
{
__device__ SingleParent operator()(const unsigned int& idx) const
{
return SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initSlicedCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSlicedSingles());
return 0;
}
I would definitely be guilty of some classic object slicing...
So, I asked myself, what about a vector of pointers that may give me some sweet polymorphism? Smart pointers in C++ are a thing, and thrust iterators can do some really impressive things, so let's give it a shot, I figured. The following code compiles.
struct initCouples : public thrust::unary_function<unsigned int, MarriedCouple*>
{
__device__ MarriedCouple* operator()(const unsigned int& idx) const
{
return new MarriedCouple(idx % 3, 20 + idx, 19 + idx); // Memory issues?
}
};
struct initSingles : public thrust::unary_function<unsigned int, SingleParent*>
{
__device__ SingleParent* operator()(const unsigned int& idx) const
{
return new SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family*> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSingles());
Family A = *(familiesInTown[2]); // Compiles, but object slicing takes place (in theory)
std::cout << A.showNumOfParents() << "\n"; // Segmentation fault
return 0;
}
Seems like I've hit a wall here. Am I understanding memory management correctly? (VTables, etc). Are my objects being instantiated and populated on the device? Am I leaking memory like there is no tomorrow?
For what it's worth, in order to avoid object slicing, I tried with a dynamic_cast<DerivedPointer*>(basePointer). That's why I made my Family destructor virtual.
Family *pA = familiesInTown[2];
MarriedCouple *pB = dynamic_cast<MarriedCouple*>(pA);
The following lines compile, but, unfortunately, a segfault is thrown again. CUDA-Memcheck won't tell me why.
std::cout << "Ages " << (pB -> showAgeOfParent1()) << ", " << (pB -> showAgeOfParent2()) << "\n";
and
MarriedCouple B = *pB;
std::cout << "Ages " << B.showAgeOfParent1() << ", " << B.showAgeOfParent2() << "\n";
In short, what I need is a class interface for objects that will have different properties, with different numbers of members among each other, but that I can store in one common vector (that's why I want a base class) that I can manipulate on the GPU. My intention is to work with them both in thrust transformations and in CUDA kernels via thrust::raw_pointer_casting, which has worked flawlessly for me until I've needed to branch out my classes into a base one and several derived ones. What is the standard procedure for that?
Thanks in advance!
I am not going to attempt to answer everything in this question, it is just too large. Having said that here are some observations about the code you posted which might help:
The GPU side new operator allocates memory from a private runtime heap. As of CUDA 6, that memory cannot be accessed by the host side CUDA APIs. You can access the memory from within kernels and device functions, but that memory cannot be accessed by the host. So using new inside a thrust device functor is a broken design that can never work. That is why your "vector of pointers" model fails.
Thrust is fundamentally intended to allow data parallel versions of typical STL algorithms to be applied to POD types. Building a codebase using complex polymorphic objects and trying to cram those through Thrust containers and algorithms might be made to work, but it isn't what Thrust was designed for, and I wouldn't recommend it. Don't be surprised if you break thrust in unexpected ways if you do.
CUDA supports a lot of C++ features, but the compilation and object models are much simpler than even the C++98 standard upon which they are based. CUDA lacks several key features (RTTI for example) which make complex polymorphic object designs workable in C++. My suggestion is use C++ features sparingly. Just because you can do something in CUDA doesn't mean you should. The GPU is a simple architecture and simple data structures and code are almost always more performant than functionally similar complex objects.
Having skim read the code you posted, my overall recommendation is to go back to the drawing board. If you want to look at some very elegant CUDA/C++ designs, spend some time reading the code bases of CUB and CUSP. They are both very different, but there is a lot to learn from both (and CUSP is built on top of Thrust, which makes it even more relevant to your usage case, I suspect).
I completely agree with #talonmies answer. (e.g. I don't know that thrust has been extensively tested with polymorphism.) Furthermore, I have not fully parsed your code. I post this answer to add additional info, in particular that I believe some level of polymorphism can be made to work with thrust.
A key observation I would make is that it is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions. This means that polymorphic objects created on the host cannot be passed to the device (via thrust, or in ordinary CUDA C++). (One basis for this limitation is the requirement for virtual function tables in the objects, which will necessarily be different between host and device, coupled with the fact that it is illegal to directly take the address of a device function in host code).
However, polymorphism can work in device code, including thrust device functions.
The following example demonstrates this idea, restricting ourselves to objects created on the device although we can certainly initialize them with host data. I have created two classes, Triangle and Rectangle, derived from a base class Polygon which includes a virtual function area. Triangle and Rectangle inherit the function set_values from the base class but replace the virtual area function.
We can then manipulate objects of those classes polymorphically as demonstrated here:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/sequence.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#define N 4
class Polygon {
protected:
int width, height;
public:
__host__ __device__ void set_values (int a, int b)
{ width=a; height=b; }
__host__ __device__ virtual int area ()
{ return 0; }
};
class Rectangle: public Polygon {
public:
__host__ __device__ int area ()
{ return width * height; }
};
class Triangle: public Polygon {
public:
__host__ __device__ int area ()
{ return (width * height / 2); }
};
struct init_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
(thrust::get<0>(arg)).set_values(thrust::get<1>(arg), thrust::get<2>(arg));}
};
struct setup_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
if (thrust::get<0>(arg) == 0)
thrust::get<1>(arg) = &(thrust::get<2>(arg));
else
thrust::get<1>(arg) = &(thrust::get<3>(arg));}
};
struct area_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
thrust::get<1>(arg) = (thrust::get<0>(arg))->area();}
};
int main () {
thrust::device_vector<int> widths(N);
thrust::device_vector<int> heights(N);
thrust::sequence( widths.begin(), widths.end(), 2);
thrust::sequence(heights.begin(), heights.end(), 3);
thrust::device_vector<Rectangle> rects(N);
thrust::device_vector<Triangle> trgls(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(rects.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(rects.end(), widths.end(), heights.end())), init_f());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(trgls.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(trgls.end(), widths.end(), heights.end())), init_f());
thrust::device_vector<Polygon *> polys(N);
thrust::device_vector<int> selector(N);
for (int i = 0; i<N; i++) selector[i] = i%2;
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(selector.begin(), polys.begin(), rects.begin(), trgls.begin())), thrust::make_zip_iterator(thrust::make_tuple(selector.end(), polys.end(), rects.end(), trgls.end())), setup_f());
thrust::device_vector<int> areas(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(polys.begin(), areas.begin())), thrust::make_zip_iterator(thrust::make_tuple(polys.end(), areas.end())), area_f());
thrust::copy(areas.begin(), areas.end(), std::ostream_iterator<int>(std::cout, "\n"));
return 0;
}
I suggest compiling the above code for a cc2.0 or newer architecture. I tested with CUDA 6 on RHEL 5.5.
(The polymorphic example idea, and some of the code, was taken from here.)

Tensorflow GPU new op memory allocation

I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data() should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
}
}
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
//
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
ExampleCudaKernel<T>
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA
When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data());
}
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.

CUDA C++11, array of lambdas, function by index, not working

I am having trouble trying to make a CUDA program manage an array of lambdas by their index. An example code that reproduces the problem
#include <cuda.h>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <cassert>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess) {
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
template<typename Lambda>
__global__ void kernel(Lambda f){
int t = blockIdx.x * blockDim.x + threadIdx.x;
printf("device: thread %i: ", t);
printf("f() = %i\n", f() );
}
int main(int argc, char **argv){
// arguments
if(argc != 2){
fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
exit(EXIT_FAILURE);
}
int i = atoi(argv[1]);
// lambdas
auto lam0 = [] __host__ __device__ (){ return 333; };
auto lam1 = [] __host__ __device__ (){ return 777; };
// make vector of functions
std::vector<int(*)()> v;
v.push_back(lam0);
v.push_back(lam1);
// host: calling a function by index
printf("host: f() = %i\n", (*v[i])() );
// device: calling a function by index
kernel<<< 1, 1 >>>( v[i] ); // does not work
//kernel<<< 1, 1 >>>( lam0 ); // does work
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return EXIT_SUCCESS;
}
Compiling with
nvcc -arch sm_60 -std=c++11 --expt-extended-lambda main.cu -o prog
The error I get when running is
➜ cuda-lambda ./prog 0
host: f() = 333
device: GPUassert: invalid program counter main.cu 53
It seems that CUDA cannot manage the int(*)() function pointer form (while host c++ does work properly). On the other hand, each lambda is managed as a different data type, no matter if they are identical in code and have the same contract. Then, how can we achieve function by index in CUDA?
There are a few considerations here.
Although you suggest wanting to "manage an array of lambdas", you are actually relying on the graceful conversion of a lambda to a function pointer (possible when the lambda does not capture).
When you mark something as __host__ __device__, you are declaring to the compiler that two copies of said item need to be compiled (with two obviously different entry points): one for the CPU, and one for the GPU.
When we take a __host__ __device__ lambda and ask it to degrade to a function pointer, we are then left with the question "which function pointer (entry point) to choose?" The compiler no longer has the option to carry about the experimental lambda object anymore, and so it must choose one or the other (host or device, CPU or GPU) for your vector. Whichever one it chooses, the vector could (will) break if used in the wrong environment.
One takeaway from this is that your two test cases are not the same. In one case (broken) you are passing a function pointer to the kernel (so the kernel is templated to accept a function pointer argument) and in the other case (working) you are passing a lambda to the kernel (so the kernel is templated to accept a lambda argument).
The problem here, in my view, is not simply arising out of use of a container, but arising out of the type of container you are using. I can demonstrate this in a simple way (see below) by converting your vector to a vector of actual lambda type. In that case, we can make the code "work" (sort of), but since every lambda has a unique type, this is an uninteresting demonstration. We can create a multi-element vector, but the only element we can store in it is one of your two lambdas (not both at the same time).
If we use a container that can handle dissimilar types (e.g. std::tuple), perhaps we can make some progress here, but I know of no direct method to index through the elements of such a container. Even if we could, the template kernel accepting lambda as argument/template type would have to be instantiated for each lambda.
In my view, function pointers avoid this particular type "messiness".
Therefore, as an answer to this question:
Then, how can we achieve function by index in CUDA?
I would suggest for the time being that function by index in host code be separated (e.g. two separate containers) from function by index in device code, and for function by index in device code, you use any of the techniques (which don't use or depend on lambdas) covered in other questions, such as this one.
Here is a worked example (I think) demonstrating the note above, that we can create a vector of lambda "type", and use the resultant element(s) from that vector as lambdas in both host and device code:
$ cat t64.cu
#include <cuda.h>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <cassert>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess) {
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
template<typename Lambda>
__global__ void kernel(Lambda f){
int t = blockIdx.x * blockDim.x + threadIdx.x;
printf("device: thread %i: ", t);
printf("f() = %i\n", f() );
}
template <typename T>
std::vector<T> fill(T L0, T L1){
std::vector<T> v;
v.push_back(L0);
v.push_back(L1);
return v;
}
int main(int argc, char **argv){
// arguments
if(argc != 2){
fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
exit(EXIT_FAILURE);
}
int i = atoi(argv[1]);
// lambdas
auto lam0 = [] __host__ __device__ (){ return 333; };
auto lam1 = [] __host__ __device__ (){ return 777; };
auto v = fill(lam0, lam0);
// make vector of functions
// std::vector< int(*)()> v;
// v.push_back(lam0);
// v.push_back(lam1);
// host: calling a function by index
// host: calling a function by index
printf("host: f() = %i\n", (*v[i])() );
// device: calling a function by index
kernel<<< 1, 1 >>>( v[i] ); // does not work
//kernel<<< 1, 1 >>>( lam0 ); // does work
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return EXIT_SUCCESS;
}
$ nvcc -arch sm_61 -std=c++11 --expt-extended-lambda t64.cu -o t64
$ cuda-memcheck ./t64 0
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck ./t64 1
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$
As mentioned above already, this code is not a sensible code. It is advanced to prove a particular point.

Implicit constructor in CUDA kernel call

I'm trying to pass some POD to a kernel which has as parameters some non-POD, and has non explicit constructors. Idea behind that is: allocate some memory on the host, pass the memory to the kernel, and it encapsulate the memory in the objects without the user to explicitly do that step.
The constructors are marked as __device__ code, but they are not called when passing the parameters, and I can't figure out why.
My question is not really related about how should I do the thing, but trying to understand what's happening behind the scenes.
Here an example (I'm using CUDA 5 with a GPU of capability 2.1, hence the printf).
#include <stdio.h>
struct Test {
__device__ Test() {
printf("Default\n"),
_n = 0;
}
__device__ Test(int n) {
printf("Construct %d\n", n);
_n = n;
}
__device__ Test(const Test &t) {
printf("Copy constr %d\n", t._n);
_n = t._n;
}
__device__ Test &operator=(const Test &t) {
printf("Assignment %d\n", t._n);
_n = t._n;
return *this;
}
__device__ int calc() const {
printf("Calculating %d\n", threadIdx.x + 10 * _n);
return threadIdx.x + 10 * _n;
}
int _n;
};
__global__ void dosome(Test a, Test b) {
printf("Kernel data %d %d\n", a._n, b._n);
a.calc();
b.calc();
}
int main(int argc, char **argv) {
dosome<<<1, 2>>>(2, 3);
cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != cudaSuccess)
printf("kernel launch failed with error:\n\t%s\n",cudaGetErrorString(cudaerr));
return 0;
}
EDIT: Forgot to say that, none of the constructor message is printed, but the calc and kernel message are.
EDIT2: Is it guaranteed that CUDA will initialize a Test object before copying it on the device?
You have to see a constructor just like a normal method. If you qualify it with __host__, then you'll be able to call it host-side. If you qualify it with __device__, you'll be able to call it device-side. If you qualify it with both, you'll be able to call it on both sides.
What happens when you do dosome<<<1, 2>>>(2, 3); is that the two objects are implictly constructed (because your constructor is not explicit, so maybe that's confusing you too) host side and then memcpy'd to the device. There is no copy-constructor involved in the process.
Let's illustrate this:
__global__ void dosome(Test a, Test b) {
a.calc();
b.calc();
}
int main(int argc, char **argv) {
dosome<<<1, 2>>>(2, 3); // Constructors must be at least __host__
return 0;
}
// Outputs:
Construct 2 (from the host side)
Construct 3 (from the host side)
Now if you change your kernel to take ints instead of Test:
__global__ void dosome(int arga, int argb) {
// Constructors must be at least __device__
Test a(arga);
Test b(argb);
a.calc();
b.calc();
}
int main(int argc, char **argv) {
dosome<<<1, 2>>>(2, 3);
return 0;
}
// Outputs:
Construct 2 (from the device side)
Construct 3 (from the device side)
Ok, I found it works (constructors are called) if I add both __host__ and __device__ qualifiers to the constructors. The constructor of the objects happened at host side, and then they were copied to device (stack?). This is why the constructors weren't called: they were device code (but what was called on the host side?!?)
Using both __host__ and __device__ in the constructors allowed to use the class without problems.
EDIT: Still, I'm not sure if the construction always happens before the copy to device.

CUDA function call-able by either the device or host

I have a re-useable function in some CUDA code that needs to be called from both the device and the host. Is there an appropriate qualifier for this?
e.g. what's the correct definition for func1 in this case:
int func1 (int a, int b) {
return a+b;
}
__global__ devicecode (float *A) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
A[i] = func1(i,i);
}
void main() {
// Normal cuda memory set-up
// Call func1 from inside main:
int j = func1(2,4)
// Normal cuda memory copy / program run / retrieve data
}
So far I can only get this to work by having the function twice: once explicitly for the device and once for the host. Is there a better way?
From the CUDA Programming Guide:
The __device__ and __host__ qualifiers can be used together however, in
which case the function is compiled for both the host and the device.