Copying structure containing 2d pointer to device

Copying structure containing 2d pointer to device - c++

I have a question-related to copying structure containing 2D pointer to the device from the host, my code is as follow
struct mymatrix
{
matrix m;
int x;
};
size_t pitch;
mymatrix m_h[5];
for(int i=0; i<5;i++){
m_h[i].m = (float**) malloc(4 * sizeof(float*));
for (int idx = 0; idx < 4; ++idx)
{
m_h[i].m[idx] = (float*)malloc(4 * sizeof(float));
}
}
mymatrix *m_hh = (mymatrix*)malloc(5*sizeof(mymatrix));
memcpy(m_hh,m_h,5*sizeof(mymatrix));
for(int i=0 ; i<5 ;i++)
{
cudaMallocPitch((void**)&(m_hh[i].m),&pitch,4*sizeof(float),4);
cudaMemcpy2D(m_hh[i].m, pitch, m_h[i].m, 4*sizeof(float), 4*sizeof(float),4,cudaMemcpyHostToDevice);
}
mymatrix *m_d;
cudaMalloc((void**)&m_d,5*sizeof(mymatrix));
cudaMemcpy(m_d,m_hh,5*sizeof(mymatrix),cudaMemcpyHostToDevice);
distance_calculation_begins<<<1,16>>>(m_d,pitch);
Problem
With this code I am unable to access 2D pointer elements of the structure, but I can access x from that structure in device. e.g. such as I have receive m_d with pointer mymatrix* m if I initialize
m[0].m[0][0] = 5;
and printing this value such as
cuPrintf("The value is %f",m[0].m[0][0]);
in the device, I get no output. Means I am unable to use 2D pointer, but if I try to access
m[0].x = 5;
then I am able to print this. I think my initializations are correct, but I am unable to figure out the problem. Help from anyone will be greatly appreciated.

In addition to the issues that #RobertCrovella noted on your code, also note:
You are only getting a shallow copy of your structure with the memcpy that copies m_h to m_hh.
You are assuming that pitch is the same in all calls to cudaMemcpy2D() (you overwrite the pitch and use only the latest copy at the end). I think that might be safe assumption for now but it could change in the future.
You are using cudaMemcpyHostToDevice() with cudaMemcpyHostToDevice to copy to m_hh, which is on the host, not the device.
Using many small buffers and tables of pointers is not efficient in CUDA. The small allocations and deallocations can end up taking a lot of time. Also, using tables of pointers cause extra memory transactions because the pointers must be retrieved from memory before they can be used as bases for indexing. So, if you consider a construct such as this:
a[10][20][30] = 3
The pointer at a[10] must first be retrieved from memory, causing your warp to be put on hold for a long time (up to around 600 cycles on Fermi). Then, the same thing happens for the second pointer, adding another 600 cycles. In addition, these requests are unlikely to be coalesced causing even more memory transactions.
As Robert mentioned, the solution is to flatten your memory structures. I've included an example for this, which you may be able to use as a basis for your program. As you can see, the code is overall much simpler. The part that does become a bit more complex is the index calculations. Also, this approach assumes that your matrixes are all of the same size.
I have added error checking as well. If you had added error checking in your code, you would have found at least a couple of the bugs without any extra effort.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
typedef float* mymatrix;
const int n_matrixes(5);
const int w(4);
const int h(4);
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(mymatrix m_d, size_t pitch_floats)
{
// Print the value at [2][3][4].
printf("%f ", m_d[3 + (2 * h + 4) * pitch_floats]);
}
int main()
{
mymatrix m_h;
gpuErrchk(cudaMallocHost(&m_h, n_matrixes * w * sizeof(float) * h));
// Set the value at [2][3][4].
m_h[2 * (w * h) + 3 + 4 * w] = 5.0f;
// Create a device copy of the matrix.
mymatrix m_d;
size_t pitch;
gpuErrchk(cudaMallocPitch((void**)&m_d, &pitch, w * sizeof(float), n_matrixes * h));
gpuErrchk(cudaMemcpy2D(m_d, pitch, m_h, w * sizeof(float), w * sizeof(float), n_matrixes * h, cudaMemcpyHostToDevice));
test<<<1,1>>>(m_d, pitch / sizeof(float));
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
}

Your matrix m class/struct member appears to be some sort of double pointer based on how you are initializing it on the host:
m_h[i].m = (float**) malloc(4 * sizeof(float*));
Copying an array of structures with embedded pointers between host and device is somewhat compilicated. Copying a data structure that is pointed to by a double pointer is also complicated.
For an array of structures with embedded pointers, refer to this posting.
For copying a 2D array (double pointer, i.e. **), refer to this posting. We don't use cudaMallocPitch/cudaMemcpy2D to accomplish this. (Note that cudaMemcpy2D takes single pointer * arguments, you are passing it double pointer ** arguments e.g. m_h[i].m)
Instead of the above approaches, it's recommended that you flatten your data so that it can all be referenced with single pointer referencing, with no embedded pointers.

Related

correct way of copying and printing 2dim array on CUDA device [duplicate]

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?

In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

CUDA cudaMemCpy doesn't appear to copy despite CudaSuccess

I'm just starting with CUDA and this is my very first project. I've done a search for this issue and while I've noticed other people have had similar problems, none of the suggestions seemed relevant to my specific issue or have helped in my case.
As an exercise, I'm trying to write an n-body simulation using CUDA. At this stage I'm not interested whether my specific implementation is efficient or not, I'm just looking for something that works and I can refine it later. I'll also need to update the code later, once it's working, to work on my SLI configuration.
Here's a brief outline of the process:
Create X and Y position, velocity, acceleration vectors.
Create same vectors on GPU and copy values across
In a loop: (i) calculate acceleration for the iteration, (ii) apply acceleration to velocities and positions, and (iii) copy positions back to host for display.
(Display not implemented yet. I'll do this later)
Don't worry about the acceleration calculation function for now, here is the update function:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
vel_x[i] += acc_x[i];
vel_y[i] += acc_y[i];
pos_x[i] += vel_x[i];
pos_y[i] += vel_y[i];
}
}
And here's some of the code in the main method:
cudaError t;
t = cudaMalloc(&d_pos_x, N * sizeof(double));
t = cudaMalloc(&d_pos_y, N * sizeof(double));
t = cudaMalloc(&d_vel_x, N * sizeof(double));
t = cudaMalloc(&d_vel_y, N * sizeof(double));
t = cudaMalloc(&d_acc_x, N * sizeof(double));
t = cudaMalloc(&d_acc_y, N * sizeof(double));
t = cudaMemcpy(d_pos_x, pos_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_pos_y, pos_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_x, vel_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_y, vel_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_x, acc_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_y, acc_y, N * sizeof(double), cudaMemcpyHostToDevice);
while (true)
{
calc_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
apply_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
t = cudaMemcpy(pos_x, d_pos_x, N * sizeof(double), cudaMemcpyDeviceToHost);
t = cudaMemcpy(pos_y, d_pos_y, N * sizeof(double), cudaMemcpyDeviceToHost);
std::cout << pos_x[0] << std::endl;
}
Every loop, cout writes the same value, whatever random value it was set to when the position arrays were original created. If I change the code in apply_acc to something like:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
pos_x[i] += 1.0;
pos_y[i] += 1.0;
}
}
then it still gives the same value, so either apply_acc isn't being called or the cudaMemcpy isn't copying the data back.
All the cudaMalloc and cudaMemcpy calls return cudaScuccess.
Here's a PasteBin link to the complete code. It should be fairly simple to follow as there's a lot of repetition for the various arrays.
Like I said, I've never written CUDA code before, and I wrote this based on the #2 CUDA example video from NVidia where the guy writes the parallel array addition code. I'm not sure if it makes any difference, but I'm using 2x GTX970's with the latest NVidia drivers and CUDA 7.0 RC, and I chose not to install the bundled drivers when installing CUDA as they were older than what I had.

This won't work:
const int N = 100000;
...
calc_acc<<<1, N>>>(...);
apply_acc<<<1, N>>>(...);
The second parameter of a kernel launch config (<<<...>>>) is the threads per block parameter. It is limited to either 512 or 1024 depending on how you are compiling. These kernels will not launch, and the type of error this produces needs to be caught by using correct CUDA error checking. Simply looking at the return values of subsequent CUDA API functions will not indicate the presence of this type of error (which is why you are seeing cudaSuccess subsequently).
Regarding the concept itself, I suggest you learn more about CUDA thread and block hierarchy. To launch a large number of threads, you need to use both parameters (i.e. niether of the first two parameters should be 1) of the kernel launch config. This is usually advisable from a performance perspective as well.

How to allocate Array of Pointers and preserve them for multiple kernel calls in cuda

I am trying to implement an algorithm in cuda and I need to allocate an Array of Pointers that point to an Array of Structs. My struct is, lets say:
typedef struct {
float x, y;
} point;
I know that If I want to preserve the arrays for multiple kernel calls I have to control them from the host, is that right? The initialization of the pointers must be done from within the kernel. To be more specific, the Array of Struct P will contain random order of cartesian points while the dev_S_x will be a sorted version as to x coordinate of the points in P.
I have tried with:
__global__ void test( point *dev_P, point **dev_S_x) {
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
dev_P[tid].x = 3.141516;
dev_P[tid].y = 3.141516;
dev_S_x[tid] = &dev_P[tid];
...
}
and:
int main( void ) {
point *P, *dev_P, **S_x, *dev_S_x;
P = (point*) malloc (N * sizeof (point) );
S_x = (point**) malloc (N * sizeof (point*));
// allocate the memory on the GPU
cudaMalloc( (void**) &dev_P, N * sizeof(point) );
cudaMalloc( (void***) &dev_S_x, N * sizeof(point*));
// copy the array P to the GPU
cudaMemcpy( dev_P, P, N * sizeof(point), cudaMemcpyHostToDevice);
cudaMemcpy( dev_S_x,S_x,N * sizeof(point*), cudaMemcpyHostToDevice);
test <<<1, 1 >>>( dev_P, &dev_S_x);
...
return 0;
}
which leads to many
First-chance exception at 0x000007fefcc89e5d (KernelBase.dll) in Test_project_cuda.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0020f920..
Critical error detected c0000374
Am I doing something wrong in the cudamalloc of the array of pointers or is it something else? Is the usage of (void***) correct? I would like to use for example dev_S_x[tid]->x or dev_S_x[tid]->y from within the kernels pointing to device memory addresses. Is that feasible?
Thanks in advance

dev_S_x should be declared as point ** and should be passed to the kernel as a value (i.e. test <<<1, 1 >>>(dev_P, dev_S_x);).
Putting that to one side, what you describe sounds like a natural fit for Thrust, which will give you a simpler memory management strategy and access to fast sort routines.

Access vector of pointers to other vectors on a GPU

so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following:
for(int i =0;i<N;i++){
dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');
}
where A[i] will be a 2D matrix of some size.
I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example:
for(int i =0;i<N;i++){
status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}
but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's.
i currently have the following code that compiles:
double **GlobalFVecs_d;
double **GlobalFPVecs_d;
extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){
cudaError_t err;
GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
checkCudaError(err);
for(int i =0; i < numpulsars;i++){
err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
checkCudaError(err);
err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
checkCudaError(err);
}
err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
checkCudaError(err);
}
but if i now try and access it with:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;
for(int i =0; i < numpulsars; i++){
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
}
it seg faults here, is this not how to get at the data?
The kernal function that i'm calling is just:
__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {
// Each thread computes one element of C
// by accumulating results into Cvalue
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int subrow=row-thispulsar*numcoeff;
int subcol=row-thispulsar*numcoeff;
__syncthreads();
if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;
FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];
}
What am i not doing right? Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works.
Do you think this is the best way to go about doing this? If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that.
Cheers
Lindley

change this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
to this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);
and I believe it will work.
Your methodology of handling pointers is mostly correct. However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. But we are not allowed to dereference device pointers in host code.
However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. This array (GlobalFPVecs_d) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel.
It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself.

CUDA - memcpy2d - wrong pitch

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js