Temporary CUDA Device Arrays - c++

Having been playing around with this grand CUDA experiment for a few months now, I find myself experimenting more and trying to pull away from the tutorial examples.
My question is this : If I want to just use arrays on the GPU for something like temporary storage without copying them back to the host for display/output, can I just create a device array with __device__ double array[numpoints]; Then for anything I want to take back from the GPU, I need to do the whole cudaMalloc, cudaMemcpy spiel, right? Additionally, is there any difference between one method or another? I thought they both create arrays in global memory.

See this discription about the __device__ qualifier. So if you declare it __device__ you cannot access it in the host through cudaMemcpy but there are other mentioned in the link.
Instead what you can do is declare a global pointer(ie., without __device__) in host code and allocate using the cudaMalloc. So you can use the same to copy the result back to host using the cudaMemcpy.

You can create, fill and use globl memory arrays without the need of using cudaMemcpy to copy data from the host for initialization, if this is what are you asking. In the following simple example, I'm creating a global memory array which is initialized directly on the device and then I'm releasing it when not needed anymore.
#include<stdio.h>
__global__ void init_temp_data(float* temp_data) {
temp_data[threadIdx.x] = 3.f;
}
__global__ void copy_global_data(float* temp_data, float* d_data) {
d_data[threadIdx.x] = temp_data[threadIdx.x];
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main() {
float* data = (float*)malloc(16*sizeof(float));
float* d_data; gpuErrchk(cudaMalloc((void**)&d_data,16*sizeof(float)));
float* temp_data; gpuErrchk(cudaMalloc((void**)&temp_data,16*sizeof(float)));
init_temp_data<<<1,16>>>(temp_data);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
copy_global_data<<<1,16>>>(temp_data,d_data);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaFree(temp_data));
gpuErrchk(cudaMemcpy(data,d_data,16*sizeof(float),cudaMemcpyDeviceToHost));
for (int i=0; i<16; i++) printf("Element number %i is equal to %f\n",i,data[i]);
getchar();
return 0;
}

Related

Copy huge structure of arrays to GPU

I need to transform an existing Code about SPH (=Smoothed Particle Hydrodynamics) into a code that can be run on a GPU.
Unfortunately, it has a lot of data structure that I need to copy from the CPU to the GPU. I already looked up in the web and I thought, that I did the right thing for my copying-code, but unfortunately, I get an error (something with unhandled exception).
When I opened the Debugger, I saw that there is no information passed to my variables that should be copied to the GPU. It's just saying "The memory could not be read".
So here is an example of one data structure that needs to be copied to the GPU:
__device__ struct d_particle_data
{
float Pos[3]; /*!< particle position at its current time */
float PosMap[3]; /*!< initial boundary particle postions */
float Mass; /*!< particle mass */
float Vel[3]; /*!< particle velocity at its current time */
float GravAccel[3]; /*!< particle acceleration due to gravity */
}*d_P;
and I pass it on the GPU with the following:
cudaMalloc((void**)&d_P, N*sizeof(sph_particle_data));
cudaMemcpy(d_P, P, N*sizeof(d_sph_particle_data), cudaMemcpyHostToDevice);
The data structure P looks the same as the data structure d_P. Does anybody can help me?
EDIT
So, here's a pretty small part of that code:
First, the headers I have to use in the code:
Allvars.h: Variables that I need on the host
struct particle_data
{
float a;
float b;
}
*P;
proto.h: Header with all the functions
extern void main_GPU(int N, int Ntask);
Allvars_gpu.h: all the variables that have to be on the GPU
__device__ struct d_particle_data
{
float a;
float b;
}
*d_P;
So, now I call from the .cpp-File the -.cu-File:
hydra.cpp:
#include <stdio.h>
#include <cuda_runtime.h>
extern "C" {
#include "proto.h"
}
int main(void) {
int N_gas = 100; // Number of particles
int NTask = 1; // Number of CPUs (Code has MPI-stuff included)
main_GPU(N_gas,NTask);
return 0;
}
Now, the action takes place in the .cu-File:
hydro_gpu.cu:
#include <cuda_runtime.h>
#include <stdio.h>
extern "C" {
#include "Allvars_gpu.h"
#include "allvars.h"
#include "proto.h"
}
__device__ void hydro_evaluate(int target, int mode, struct d_particle_data *P) {
int c = 5;
float a,b;
a = P[target].a;
b = P[target].b;
P[target].a = a+c;
P[target].b = b+c;
}
__global__ void hydro_particle(struct d_particle_data *P) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
hydro_evaluate(i,0,P);
}
void main_GPU(int N, int Ntask) {
int Blocks;
cudaMalloc((void**)&d_P, N*sizeof(d_particle_data));
cudaMemcpy(d_P, P, N*sizeof(d_particle_data), cudaMemcpyHostToDevice);
Blocks = (N+N-1)/N;
hydro_particle<<<Blocks,N>>>(d_P);
cudaMemcpy(P, d_P, N*sizeof(d_particle_data), cudaMemcpyDeviceToHost);
cudaFree(d_P);
}
The really short answer is probably not to declare *d_P as a static __device__ symbol. Those cannot be passed as device pointer arguments to cudaMalloc, cudaMemcpy, or kernel launches and your use of __device__ is both unecessary and incorrect in this example.
If you make that change, your code might start working. Note that I lost interest in trying to actually compile your MCVE code some time ago, and there might well be other problems, but I'm too bored with this question to look for them. This answer has mostly been added to get this question off the unanswered queue for the CUDA tag.

Device pointer in a device class (Cuda C++)

I would like to implement a device side vector class which encapsulates a pointer to the elements of the container.
After I instantiate an object of this class I have no access to the inside pointer. It always says 'Access violation writing location some device memory address'.
My code is the following:
#include <iostream>
#include <cuda_runtime.h>
template <typename T>
class DeviceVector
{
private:
T* m_bValues;
std::size_t m_bSize;
public:
__host__
void* operator new(std::size_t size)
{
DeviceVector<T>* object = nullptr;
cudaMalloc((void**)&object, size);
return object;
}
__host__
void operator delete(void* object)
{
cudaFree(object);
}
__host__
DeviceVector(std::size_t size = 1)
{
cudaMemcpy(&m_bSize, &size, sizeof(std::size_t), cudaMemcpyHostToDevice);
// At this cudaMalloc I get Access violation writing location...
cudaMalloc((void**)&m_bValues, size * sizeof(T));
// It's an alternative solution here
T* ptr;
cudaMalloc((void**)&ptr, size * sizeof(T));
cudaMemcpy(&m_bValues, &ptr, sizeof(T*), cudaMemcpyHostToDevice);
// The memory is allocated
// But I can't access it through m_bValues pointer
// It is also Access violation writing location...
}
__host__
~DeviceVector()
{
// Access violation here if I use the second solution in the constructor
cudaFree(m_bValues);
}
};
int main()
{
DeviceVector<int>* vec = new DeviceVector<int>();
delete vec;
return 0;
}
Note:
I have access to the size attribute.
So my questions are:
How to allocate memory for this class to get access to the pointer inside?
Is this even possible to encapsulate a pointer into a class on the device?
This line is illegal:
cudaMalloc((void**)&m_bValues, size * sizeof(T));
because your new operator allocated the object on the device:
cudaMalloc((void**)&object, size);
return object;
and the constructor was called to operate on that allocation. Therefore &m_bValues is taking the address of a device variable in host code which is illegal in CUDA. If you do that, and then attempt to use it in host code (i.e. the cudaMalloc operation), you're going to get a seg fault. cudaMalloc creates a device allocation of a particular size, and then stores the device pointer to that allocation in a variable that is expected to be resident on the host. If you pass it a device address to store that pointer into instead, cudaMalloc will segfault trying to write the pointer value.
Your alternative solution is a somewhat better approach, and is the general idea when it's necessary to copy a pointer to a device allocation to a variable resident on the device.
But you've still basically made the allocation that m_bValues points to inaccessible from the host. (ptr, being a temporary variable, won't help, and creating another variable in the class to hold a value like ptr won't help either because the entire class is allocated and resident on the device.) For the same reason that you're not allowed to use &m_bValues in the previous cudaMalloc operation, you won't be able to use it directly in any other host code (except as the target for cudaMempcy host->device when copying the pointer value itself).
I don't think there are any simple fixes for this. I suggest re-crafting the object to live on the host, and provide appropriate host- and device-side allocations for corresponding pointers and parameters (like size).
It also seems like you're re-inventing the wheel. You might want to investigate thrust device vectors (which are easily usable with ordinary CUDA code.)
Anyway, this was the closest I could come up with:
#include <iostream>
#include <cuda_runtime.h>
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
template <typename T>
class DeviceVector
{
private:
T* m_bValues;
std::size_t m_bSize;
std::size_t eleSize;
public:
__host__
void* operator new(std::size_t size)
{
DeviceVector<T>* object = NULL;
object = (DeviceVector<T> *)malloc(size*sizeof(DeviceVector<T>));
return object;
}
__host__
void operator delete(void* object)
{
free(object);
}
__host__
DeviceVector(std::size_t size = 1)
{
m_bSize = size;
eleSize = sizeof(T);
cudaMalloc(&m_bValues, m_bSize*sizeof(T));
cudaCheckErrors("constructor cudaMalloc fail");
cudaMemset(m_bValues, 0, m_bSize*sizeof(T));
}
__host__
~DeviceVector()
{
cudaFree(m_bValues);
cudaCheckErrors("destructor cudaFree fail");
}
__host__
T* getDevPtr(){
return m_bValues;}
__host__
std::size_t getSize(){
return m_bSize;}
__host__
std::size_t geteleSize(){
return eleSize;}
};
int main()
{
DeviceVector<int>* vec = new DeviceVector<int>();
cudaMemset(vec->getDevPtr(), 0xFF, vec->getSize()*vec->geteleSize());
cudaCheckErrors("vector fill fail");
delete vec;
return 0;
}
You've shown very little about how you want to interact with an object of this class, so I'm just guessing here.

CUDA curand "An illegal memory access was encountered"

I've been spending a lot of time trying to figure out the cause of this problem. The following code attempts to generate a sequence of normally distributed random variables using curand on the device. It seems to generate a few successfully, but then crashes with an "illegal memory address was encountered error". Any help is much appreciated.
main.cu
#include <stdio.h>
#include <cuda.h>
#include <curand_kernel.h>
class A {
public:
__device__ A(const size_t& seed) {
printf("\nA()");
curandState state;
curand_init(seed, 0, 0, &state);
for(size_t i = 0; i < 1000; ++i)
printf("\n%f", curand_normal(&state));
}
__device__ ~A() { printf("\n~A()"); }
};
/// Kernel
__global__ void kernel(const size_t& seed) {
printf("\nHello from Kernel...");
A a(seed);
return;
}
int main(void) {
kernel<<<1,1>>>(1);
cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != CUDA_SUCCESS)
printf("kernel launch failed with error \"%s\".\n",
cudaGetErrorString(cudaerr));
return 0;
}
Output
Hello from Kernel...
A()
0.292537
-0.718359
0.958011
0.633711kernel launch failed with error "an illegal memory access was encountered".
I have ran this both on my machine (CUDA 7.0), and a supercomputing cluster (CUDA 6.5), and the same result unfolds.
Get rid of the pass-by-reference on the kernel parameter (&).
You are not allowed to write GPU kernels that have pass-by-reference parameters. A GPU kernel cannot modify a host variable. (ignoring Unified Memory, Zero-Copy, and related mechanisms which are not at issue here.)

How to access a class from one cuda kernel in the next kernel

I have a dev variable which I used to allocate space on the device using a class header.
Neu *dev_NN;
cudaStatus = cudaMalloc((void**)&dev_NN, sizeof(Neu));
Then I call a kernel which initialises the class on the GPU.
KGNN<<<1, threadsPerBlock>>>(dev_LaySze, dev_NN);
in the kernel
__global__ void KGNN(int * dev_LaySze, Neu * NN)
{
...
NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
}
After the return of this kernel I want to use another kernel to input data to class methods and retrieve output data (the allocators and copies are already done and work), such as
__global__ void KGFF(double *dev_inp, double *dev_outp, int *DataSize)
{
int i = threadIdx.x;
...
NN.Analyse(dev_inp, dev_outp, DataSize );
}
The second kernel knows nothing about the class that was created. As you would expect NN is unrecognised. How do I access the first NN without re-creating the class and re-initialising it? The second kernel has to be called several times, remembering the changes it made to the class variables earlier. I don't want to use the class with the CPU, only the GPU, and I don't want to pass it back and forth each time.
I don't think this has anything to do with CUDA, actually. I believe a similar problem would be observed if you tried this in ordinary C++ (assuming the pointer to NN is not a global variable).
The key aspect of the solution as pointed out by Park Young-Bae is simply to pass the pointer to the allocated space for NN to both kernels. There were a few other changes that I think needed to be made to what you have shown, according to my understanding of what you are trying to do (since you haven't posted a complete code.) Here's a fully worked example:
$ cat t635.cu
#include <stdio.h>
class MC {
int md;
public:
__host__ __device__ int get_md() { return md;}
__host__ __device__ MC(int val) { md = val; }
};
__global__ void kernel1(MC *d){
*d = MC(3);
}
__global__ void kernel2(MC *d){
printf("val = %d\n", d->get_md());
}
int main(){
MC *d_obj;
cudaMalloc(&d_obj, sizeof(MC));
kernel1<<<1,1>>>(d_obj);
kernel2<<<1,1>>>(d_obj);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t635 t635.cu
$ ./t635
val = 3
$
The other changes I suggest:
in your first kernel, you're passing a pointer (NN) (which presumably you have made a device allocation for), and then you are creating an opject and copying that object to the allocated space. In that case I think you need:
*NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
in your second kernel, if NN is a pointer, we must use:
NN->Analyse(dev_inp, dev_outp, DataSize );
I have made those two changes to my posted example. Again, I think this is all just C++ mechanics, not anything specific to CUDA.

Implicit constructor in CUDA kernel call

I'm trying to pass some POD to a kernel which has as parameters some non-POD, and has non explicit constructors. Idea behind that is: allocate some memory on the host, pass the memory to the kernel, and it encapsulate the memory in the objects without the user to explicitly do that step.
The constructors are marked as __device__ code, but they are not called when passing the parameters, and I can't figure out why.
My question is not really related about how should I do the thing, but trying to understand what's happening behind the scenes.
Here an example (I'm using CUDA 5 with a GPU of capability 2.1, hence the printf).
#include <stdio.h>
struct Test {
__device__ Test() {
printf("Default\n"),
_n = 0;
}
__device__ Test(int n) {
printf("Construct %d\n", n);
_n = n;
}
__device__ Test(const Test &t) {
printf("Copy constr %d\n", t._n);
_n = t._n;
}
__device__ Test &operator=(const Test &t) {
printf("Assignment %d\n", t._n);
_n = t._n;
return *this;
}
__device__ int calc() const {
printf("Calculating %d\n", threadIdx.x + 10 * _n);
return threadIdx.x + 10 * _n;
}
int _n;
};
__global__ void dosome(Test a, Test b) {
printf("Kernel data %d %d\n", a._n, b._n);
a.calc();
b.calc();
}
int main(int argc, char **argv) {
dosome<<<1, 2>>>(2, 3);
cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != cudaSuccess)
printf("kernel launch failed with error:\n\t%s\n",cudaGetErrorString(cudaerr));
return 0;
}
EDIT: Forgot to say that, none of the constructor message is printed, but the calc and kernel message are.
EDIT2: Is it guaranteed that CUDA will initialize a Test object before copying it on the device?
You have to see a constructor just like a normal method. If you qualify it with __host__, then you'll be able to call it host-side. If you qualify it with __device__, you'll be able to call it device-side. If you qualify it with both, you'll be able to call it on both sides.
What happens when you do dosome<<<1, 2>>>(2, 3); is that the two objects are implictly constructed (because your constructor is not explicit, so maybe that's confusing you too) host side and then memcpy'd to the device. There is no copy-constructor involved in the process.
Let's illustrate this:
__global__ void dosome(Test a, Test b) {
a.calc();
b.calc();
}
int main(int argc, char **argv) {
dosome<<<1, 2>>>(2, 3); // Constructors must be at least __host__
return 0;
}
// Outputs:
Construct 2 (from the host side)
Construct 3 (from the host side)
Now if you change your kernel to take ints instead of Test:
__global__ void dosome(int arga, int argb) {
// Constructors must be at least __device__
Test a(arga);
Test b(argb);
a.calc();
b.calc();
}
int main(int argc, char **argv) {
dosome<<<1, 2>>>(2, 3);
return 0;
}
// Outputs:
Construct 2 (from the device side)
Construct 3 (from the device side)
Ok, I found it works (constructors are called) if I add both __host__ and __device__ qualifiers to the constructors. The constructor of the objects happened at host side, and then they were copied to device (stack?). This is why the constructors weren't called: they were device code (but what was called on the host side?!?)
Using both __host__ and __device__ in the constructors allowed to use the class without problems.
EDIT: Still, I'm not sure if the construction always happens before the copy to device.