CUDA - memcpy2d - wrong pitch - c++

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?

In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

Related

correct way of copying and printing 2dim array on CUDA device [duplicate]

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?
In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

Copying structure containing 2d pointer to device

I have a question-related to copying structure containing 2D pointer to the device from the host, my code is as follow
struct mymatrix
{
matrix m;
int x;
};
size_t pitch;
mymatrix m_h[5];
for(int i=0; i<5;i++){
m_h[i].m = (float**) malloc(4 * sizeof(float*));
for (int idx = 0; idx < 4; ++idx)
{
m_h[i].m[idx] = (float*)malloc(4 * sizeof(float));
}
}
mymatrix *m_hh = (mymatrix*)malloc(5*sizeof(mymatrix));
memcpy(m_hh,m_h,5*sizeof(mymatrix));
for(int i=0 ; i<5 ;i++)
{
cudaMallocPitch((void**)&(m_hh[i].m),&pitch,4*sizeof(float),4);
cudaMemcpy2D(m_hh[i].m, pitch, m_h[i].m, 4*sizeof(float), 4*sizeof(float),4,cudaMemcpyHostToDevice);
}
mymatrix *m_d;
cudaMalloc((void**)&m_d,5*sizeof(mymatrix));
cudaMemcpy(m_d,m_hh,5*sizeof(mymatrix),cudaMemcpyHostToDevice);
distance_calculation_begins<<<1,16>>>(m_d,pitch);
Problem
With this code I am unable to access 2D pointer elements of the structure, but I can access x from that structure in device. e.g. such as I have receive m_d with pointer mymatrix* m if I initialize
m[0].m[0][0] = 5;
and printing this value such as
cuPrintf("The value is %f",m[0].m[0][0]);
in the device, I get no output. Means I am unable to use 2D pointer, but if I try to access
m[0].x = 5;
then I am able to print this. I think my initializations are correct, but I am unable to figure out the problem. Help from anyone will be greatly appreciated.
In addition to the issues that #RobertCrovella noted on your code, also note:
You are only getting a shallow copy of your structure with the memcpy that copies m_h to m_hh.
You are assuming that pitch is the same in all calls to cudaMemcpy2D() (you overwrite the pitch and use only the latest copy at the end). I think that might be safe assumption for now but it could change in the future.
You are using cudaMemcpyHostToDevice() with cudaMemcpyHostToDevice to copy to m_hh, which is on the host, not the device.
Using many small buffers and tables of pointers is not efficient in CUDA. The small allocations and deallocations can end up taking a lot of time. Also, using tables of pointers cause extra memory transactions because the pointers must be retrieved from memory before they can be used as bases for indexing. So, if you consider a construct such as this:
a[10][20][30] = 3
The pointer at a[10] must first be retrieved from memory, causing your warp to be put on hold for a long time (up to around 600 cycles on Fermi). Then, the same thing happens for the second pointer, adding another 600 cycles. In addition, these requests are unlikely to be coalesced causing even more memory transactions.
As Robert mentioned, the solution is to flatten your memory structures. I've included an example for this, which you may be able to use as a basis for your program. As you can see, the code is overall much simpler. The part that does become a bit more complex is the index calculations. Also, this approach assumes that your matrixes are all of the same size.
I have added error checking as well. If you had added error checking in your code, you would have found at least a couple of the bugs without any extra effort.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
typedef float* mymatrix;
const int n_matrixes(5);
const int w(4);
const int h(4);
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(mymatrix m_d, size_t pitch_floats)
{
// Print the value at [2][3][4].
printf("%f ", m_d[3 + (2 * h + 4) * pitch_floats]);
}
int main()
{
mymatrix m_h;
gpuErrchk(cudaMallocHost(&m_h, n_matrixes * w * sizeof(float) * h));
// Set the value at [2][3][4].
m_h[2 * (w * h) + 3 + 4 * w] = 5.0f;
// Create a device copy of the matrix.
mymatrix m_d;
size_t pitch;
gpuErrchk(cudaMallocPitch((void**)&m_d, &pitch, w * sizeof(float), n_matrixes * h));
gpuErrchk(cudaMemcpy2D(m_d, pitch, m_h, w * sizeof(float), w * sizeof(float), n_matrixes * h, cudaMemcpyHostToDevice));
test<<<1,1>>>(m_d, pitch / sizeof(float));
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
}
Your matrix m class/struct member appears to be some sort of double pointer based on how you are initializing it on the host:
m_h[i].m = (float**) malloc(4 * sizeof(float*));
Copying an array of structures with embedded pointers between host and device is somewhat compilicated. Copying a data structure that is pointed to by a double pointer is also complicated.
For an array of structures with embedded pointers, refer to this posting.
For copying a 2D array (double pointer, i.e. **), refer to this posting. We don't use cudaMallocPitch/cudaMemcpy2D to accomplish this. (Note that cudaMemcpy2D takes single pointer * arguments, you are passing it double pointer ** arguments e.g. m_h[i].m)
Instead of the above approaches, it's recommended that you flatten your data so that it can all be referenced with single pointer referencing, with no embedded pointers.

How to allocate Array of Pointers and preserve them for multiple kernel calls in cuda

I am trying to implement an algorithm in cuda and I need to allocate an Array of Pointers that point to an Array of Structs. My struct is, lets say:
typedef struct {
float x, y;
} point;
I know that If I want to preserve the arrays for multiple kernel calls I have to control them from the host, is that right? The initialization of the pointers must be done from within the kernel. To be more specific, the Array of Struct P will contain random order of cartesian points while the dev_S_x will be a sorted version as to x coordinate of the points in P.
I have tried with:
__global__ void test( point *dev_P, point **dev_S_x) {
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
dev_P[tid].x = 3.141516;
dev_P[tid].y = 3.141516;
dev_S_x[tid] = &dev_P[tid];
...
}
and:
int main( void ) {
point *P, *dev_P, **S_x, *dev_S_x;
P = (point*) malloc (N * sizeof (point) );
S_x = (point**) malloc (N * sizeof (point*));
// allocate the memory on the GPU
cudaMalloc( (void**) &dev_P, N * sizeof(point) );
cudaMalloc( (void***) &dev_S_x, N * sizeof(point*));
// copy the array P to the GPU
cudaMemcpy( dev_P, P, N * sizeof(point), cudaMemcpyHostToDevice);
cudaMemcpy( dev_S_x,S_x,N * sizeof(point*), cudaMemcpyHostToDevice);
test <<<1, 1 >>>( dev_P, &dev_S_x);
...
return 0;
}
which leads to many
First-chance exception at 0x000007fefcc89e5d (KernelBase.dll) in Test_project_cuda.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0020f920..
Critical error detected c0000374
Am I doing something wrong in the cudamalloc of the array of pointers or is it something else? Is the usage of (void***) correct? I would like to use for example dev_S_x[tid]->x or dev_S_x[tid]->y from within the kernels pointing to device memory addresses. Is that feasible?
Thanks in advance
dev_S_x should be declared as point ** and should be passed to the kernel as a value (i.e. test <<<1, 1 >>>(dev_P, dev_S_x);).
Putting that to one side, what you describe sounds like a natural fit for Thrust, which will give you a simpler memory management strategy and access to fast sort routines.

memory allocated in assembly using malloc - want to convert it to a3-D array in C++

I have an assembly segment of the program that does a huge malloc (typically of the order of 8Gb), populates it and does computations on it.
For debugging purposes I want to be able to convert this allocated and pre-filled memory as a 3-D array in C/C++. I specifically do not want to allocate another 8 GB because declaring unsigned char* debug_arr[crystal_size][crystal_size][crystal_size] and doing an element-by-element copy will result in a stack overflow.
I would ideally love to type cast the memory pointer to an 3D array pointer ... Is it possible ?
Objective is to verify the computation results done in Assembly segment.
My C/C++ knowledge is average. I mostly use 64-bit assembly, so request give me the C++ typecasting in some detail, please?
Env : Intel Core i7 2600K #4.4 GHz with 16 GB RAM, 64 bit assembly programming on 64 bit Windows 7, Visual Studio Express 2012
Thanks...
If you want to access a single unsigned char entry as if from a 3D array, you obviously need the relevant dimensions (call them nXDim, nYDim, nZDim for the sake of argument) and you need to know what dimension order has been assumed during writing.
If we assume that z changes less frequently than y and y less frequently than x then you can access your array via a function such as this:
unsigned char* GetEntry(int nX, int nY, int nZ)
{
return &pYourArray[(nZ * nXDim * nYDim) + (nY * nXDim) + nX];
}
First check what orderin is done in your memory . there are two types raw major orderin or column major
For row major ordering
Address = Base + ((depthindex*col_size+colindex) * row_size + rowindex) * Element_Size
For column major ordering
Address = Base + ((rowindex*col_size+colindex) * depth_size + depthindex) * Element_Size
Here is an example for you to expand on:
char array[10000]; // One dimensional array
char * mat[100]; // Matrix for 2D array
for ( int i = 0; i < 100; i++ )
mat[i] = array + i * 100;
Now, you have the matrix as a 100x100 element 2D array in the same memory as the array.
If you know the dimensions at compile time, then something like this
void * crystal_cube = 0; // set by asm magic;
typedef unsigned char * DEBUG_CUBE[2044][2044][2044];
DEBUG_CUBE debug_cube = (DEBUG_CUBE) crystal_cube;

allocate two arrays calling cudaMalloc once

Memory allocation is one of the most time consuming operations in a GPU so I wanted to allocate 2 arrays by calling cudaMalloc once using the following code:
int numElements = 50000;
size_t size = numElements * sizeof(float);
//declarations-initializations
float *d_M = NULL;
err = cudaMalloc((void **)&d_M, 2*size);
//error checking
// Allocate the device input vector A
float *d_A = d_M;
// Allocate the device input vector B
float *d_B = d_M + size;
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
//error checking
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//error checking
The original code is inside the samples folder of the cuda toolkit named vectorAdd.cu so you can assume h_A, h_B are properly initiated and the code works without the modification I made.
The result was that the second cudaMemcpy returned an error with message invalid argument.
It seems that the operation "d_M + size" does not return what someone would expect as device memory behaves differently but I don't know how.
Is it possible to make my approach (calling cudaMalloc once to allocate memory for two arrays) work? Any comments/answers on whether this is a good approach are also welcome.
UPDATE
As the answers of Robert and dreamcrash suggested I had to add number of elements (numElements) to the pointer d_M not the size which is the number of bytes. Just for reference there was no observable speedup.
You just have to replace
float *d_B = d_M + size;
with
float *d_B = d_M + numElements;
This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] you can print its first position by doing printf("%f",*R);.
And the second position? You just do printf("%f\n",*(++R)); thus r[0] + 1. You do not do r[0] + sizeof(float), like you were doing. When you do r[0] + sizeof(float) you will access the element in the position r[4] since size(float) = 4.
When you declare float *d_B = d_M + numElements; the compiler assumes that d_b will be continuously allocated in memory, and each element will have a size of a float. Hence, you do not need to specify the distance in terms of bytes but rather in terms of elements, the compiler will do the math for you. This approach is more human-friendly since it is more intuitive to express the pointer arithmetic in terms of elements than in terms of bytes. Moreover, it is also more portable, since if the number of bytes of a given type changes based on the underneath architecture, the compiler will handle that for you. Consequently, one's code will not break because one assumed a fixed byte size.
You said that "The result was that the second cudaMemcpy returned an error with message invalid argument":
If you print the number corresponding to this error, it will print 11 and if you check the CUDA API you verify that this error corresponds to :
cudaErrorInvalidValue
This indicates that one or more of the parameters passed to the API
call is not within an acceptable range of values.
In your example means that float *d_B = d_M + size; is getting out of the range.
You have allocate space for 100000 floats, d_a will start from 0 to 50000, but according to your code d_b will start from numElements * sizeof(float); 50000 * 4 = 200000, since 200000 > 100000 you are getting invalid argument.