allocate two arrays calling cudaMalloc once - c++

Memory allocation is one of the most time consuming operations in a GPU so I wanted to allocate 2 arrays by calling cudaMalloc once using the following code:
int numElements = 50000;
size_t size = numElements * sizeof(float);
//declarations-initializations
float *d_M = NULL;
err = cudaMalloc((void **)&d_M, 2*size);
//error checking
// Allocate the device input vector A
float *d_A = d_M;
// Allocate the device input vector B
float *d_B = d_M + size;
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
//error checking
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//error checking
The original code is inside the samples folder of the cuda toolkit named vectorAdd.cu so you can assume h_A, h_B are properly initiated and the code works without the modification I made.
The result was that the second cudaMemcpy returned an error with message invalid argument.
It seems that the operation "d_M + size" does not return what someone would expect as device memory behaves differently but I don't know how.
Is it possible to make my approach (calling cudaMalloc once to allocate memory for two arrays) work? Any comments/answers on whether this is a good approach are also welcome.
UPDATE
As the answers of Robert and dreamcrash suggested I had to add number of elements (numElements) to the pointer d_M not the size which is the number of bytes. Just for reference there was no observable speedup.

You just have to replace
float *d_B = d_M + size;
with
float *d_B = d_M + numElements;
This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] you can print its first position by doing printf("%f",*R);.
And the second position? You just do printf("%f\n",*(++R)); thus r[0] + 1. You do not do r[0] + sizeof(float), like you were doing. When you do r[0] + sizeof(float) you will access the element in the position r[4] since size(float) = 4.
When you declare float *d_B = d_M + numElements; the compiler assumes that d_b will be continuously allocated in memory, and each element will have a size of a float. Hence, you do not need to specify the distance in terms of bytes but rather in terms of elements, the compiler will do the math for you. This approach is more human-friendly since it is more intuitive to express the pointer arithmetic in terms of elements than in terms of bytes. Moreover, it is also more portable, since if the number of bytes of a given type changes based on the underneath architecture, the compiler will handle that for you. Consequently, one's code will not break because one assumed a fixed byte size.
You said that "The result was that the second cudaMemcpy returned an error with message invalid argument":
If you print the number corresponding to this error, it will print 11 and if you check the CUDA API you verify that this error corresponds to :
cudaErrorInvalidValue
This indicates that one or more of the parameters passed to the API
call is not within an acceptable range of values.
In your example means that float *d_B = d_M + size; is getting out of the range.
You have allocate space for 100000 floats, d_a will start from 0 to 50000, but according to your code d_b will start from numElements * sizeof(float); 50000 * 4 = 200000, since 200000 > 100000 you are getting invalid argument.

Related

correct way of copying and printing 2dim array on CUDA device [duplicate]

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?
In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

CUDA First-chance Exception Stack overflow Error

CUDA/C++ noob here.
The error I receive on attempting to debug my CUDA project is:
First-chance exception at 0x000000013F889467 in simple6.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000223000).
The program '[2668] simple6.exe' has exited with code 0 (0x0).
From research on the web, it seems that I have some large variables that are too large for the "stack" and need to be moved to the "heap".
Can someone please provide me the appropriate code modifications?
My code is below. The point of this kernel is to use h_S and h_TM to create a bunch of values and write these values into h_F at the very end. This is why h_F is never copied into the GPU.
int main()
{
int blockSize= 1024;
int gridSize = 1;
const int reps = 1024;
const int iterations = 18000;
int h_F [reps * iterations] = {0};
int h_S [reps] = {0}; // not actually zeros in my code this just simplifies things
int h_TM [2592] = {0} // not actually zeros in my code this just simplifies things
// Device input vectors
float *d_F;
double *d_S;
float *d_TM;
//Select GPU
cudaSetDevice(0);
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F, iterations * reps * sizeof(float));
cudaMalloc((void**)&d_S, reps * sizeof(double));
cudaMalloc((void**)&d_TM, 2592 * sizeof(float));
// Copy host vectors to device
cudaMemcpy( d_S, h_S, reps * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( d_TM, h_TM, 2592 * sizeof(float), cudaMemcpyHostToDevice);
// Execute the kernel
myKern<<<gridSize, blockSize>>>(d_TM, d_F, d_S, reps);
cudaDeviceSynchronize();
// Copy array back to host
cudaMemcpy( h_F, d_F, iterations * reps * sizeof(float), cudaMemcpyDeviceToHost );
// Release device memory
cudaFree(d_F);
cudaFree(d_TM);
cudaFree(d_S);
cudaDeviceReset();
return 0;
Also, related, but would making these huge input arrays "shared" variables solve my problem?
Many thanks.
So I read through your code and it seems like only one of those 3 arrays are actually going to be causing the stack overflow error. This is assuming your reps doesn't get too big. This array causing the problem is h_F. All you have to do is declare h_F so that it gets placed on the heap instead of the stack, as you said.
This is literally a one line change.
Simply declare h_F like this:
float *h_F = new float[(reps * iterations)];
Good luck!

How to allocate Array of Pointers and preserve them for multiple kernel calls in cuda

I am trying to implement an algorithm in cuda and I need to allocate an Array of Pointers that point to an Array of Structs. My struct is, lets say:
typedef struct {
float x, y;
} point;
I know that If I want to preserve the arrays for multiple kernel calls I have to control them from the host, is that right? The initialization of the pointers must be done from within the kernel. To be more specific, the Array of Struct P will contain random order of cartesian points while the dev_S_x will be a sorted version as to x coordinate of the points in P.
I have tried with:
__global__ void test( point *dev_P, point **dev_S_x) {
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
dev_P[tid].x = 3.141516;
dev_P[tid].y = 3.141516;
dev_S_x[tid] = &dev_P[tid];
...
}
and:
int main( void ) {
point *P, *dev_P, **S_x, *dev_S_x;
P = (point*) malloc (N * sizeof (point) );
S_x = (point**) malloc (N * sizeof (point*));
// allocate the memory on the GPU
cudaMalloc( (void**) &dev_P, N * sizeof(point) );
cudaMalloc( (void***) &dev_S_x, N * sizeof(point*));
// copy the array P to the GPU
cudaMemcpy( dev_P, P, N * sizeof(point), cudaMemcpyHostToDevice);
cudaMemcpy( dev_S_x,S_x,N * sizeof(point*), cudaMemcpyHostToDevice);
test <<<1, 1 >>>( dev_P, &dev_S_x);
...
return 0;
}
which leads to many
First-chance exception at 0x000007fefcc89e5d (KernelBase.dll) in Test_project_cuda.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0020f920..
Critical error detected c0000374
Am I doing something wrong in the cudamalloc of the array of pointers or is it something else? Is the usage of (void***) correct? I would like to use for example dev_S_x[tid]->x or dev_S_x[tid]->y from within the kernels pointing to device memory addresses. Is that feasible?
Thanks in advance
dev_S_x should be declared as point ** and should be passed to the kernel as a value (i.e. test <<<1, 1 >>>(dev_P, dev_S_x);).
Putting that to one side, what you describe sounds like a natural fit for Thrust, which will give you a simpler memory management strategy and access to fast sort routines.

memory allocated in assembly using malloc - want to convert it to a3-D array in C++

I have an assembly segment of the program that does a huge malloc (typically of the order of 8Gb), populates it and does computations on it.
For debugging purposes I want to be able to convert this allocated and pre-filled memory as a 3-D array in C/C++. I specifically do not want to allocate another 8 GB because declaring unsigned char* debug_arr[crystal_size][crystal_size][crystal_size] and doing an element-by-element copy will result in a stack overflow.
I would ideally love to type cast the memory pointer to an 3D array pointer ... Is it possible ?
Objective is to verify the computation results done in Assembly segment.
My C/C++ knowledge is average. I mostly use 64-bit assembly, so request give me the C++ typecasting in some detail, please?
Env : Intel Core i7 2600K #4.4 GHz with 16 GB RAM, 64 bit assembly programming on 64 bit Windows 7, Visual Studio Express 2012
Thanks...
If you want to access a single unsigned char entry as if from a 3D array, you obviously need the relevant dimensions (call them nXDim, nYDim, nZDim for the sake of argument) and you need to know what dimension order has been assumed during writing.
If we assume that z changes less frequently than y and y less frequently than x then you can access your array via a function such as this:
unsigned char* GetEntry(int nX, int nY, int nZ)
{
return &pYourArray[(nZ * nXDim * nYDim) + (nY * nXDim) + nX];
}
First check what orderin is done in your memory . there are two types raw major orderin or column major
For row major ordering
Address = Base + ((depthindex*col_size+colindex) * row_size + rowindex) * Element_Size
For column major ordering
Address = Base + ((rowindex*col_size+colindex) * depth_size + depthindex) * Element_Size
Here is an example for you to expand on:
char array[10000]; // One dimensional array
char * mat[100]; // Matrix for 2D array
for ( int i = 0; i < 100; i++ )
mat[i] = array + i * 100;
Now, you have the matrix as a 100x100 element 2D array in the same memory as the array.
If you know the dimensions at compile time, then something like this
void * crystal_cube = 0; // set by asm magic;
typedef unsigned char * DEBUG_CUBE[2044][2044][2044];
DEBUG_CUBE debug_cube = (DEBUG_CUBE) crystal_cube;

CUDA - memcpy2d - wrong pitch

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).
#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>
#define height 50
#define width 50
using namespace std;
// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;
for (int r = 0; r < height; ++r) {
int* row = (int*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
int element = row[c];
tempsum = tempsum + element;
}
}
*sum = tempsum;
}
//Host Code
int main()
{
int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));
cout << *sumhost << endl;
float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);
cout << *sumhost << endl;
return 0;
}
This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get
0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.
Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?
Moreover I would also like to ask two other questions:
If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?
In this line of your code:
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);
you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.
As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.