CUDA First-chance Exception Stack overflow Error - c++

CUDA/C++ noob here.
The error I receive on attempting to debug my CUDA project is:
First-chance exception at 0x000000013F889467 in simple6.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000223000).
The program '[2668] simple6.exe' has exited with code 0 (0x0).
From research on the web, it seems that I have some large variables that are too large for the "stack" and need to be moved to the "heap".
Can someone please provide me the appropriate code modifications?
My code is below. The point of this kernel is to use h_S and h_TM to create a bunch of values and write these values into h_F at the very end. This is why h_F is never copied into the GPU.
int main()
{
int blockSize= 1024;
int gridSize = 1;
const int reps = 1024;
const int iterations = 18000;
int h_F [reps * iterations] = {0};
int h_S [reps] = {0}; // not actually zeros in my code this just simplifies things
int h_TM [2592] = {0} // not actually zeros in my code this just simplifies things
// Device input vectors
float *d_F;
double *d_S;
float *d_TM;
//Select GPU
cudaSetDevice(0);
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F, iterations * reps * sizeof(float));
cudaMalloc((void**)&d_S, reps * sizeof(double));
cudaMalloc((void**)&d_TM, 2592 * sizeof(float));
// Copy host vectors to device
cudaMemcpy( d_S, h_S, reps * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( d_TM, h_TM, 2592 * sizeof(float), cudaMemcpyHostToDevice);
// Execute the kernel
myKern<<<gridSize, blockSize>>>(d_TM, d_F, d_S, reps);
cudaDeviceSynchronize();
// Copy array back to host
cudaMemcpy( h_F, d_F, iterations * reps * sizeof(float), cudaMemcpyDeviceToHost );
// Release device memory
cudaFree(d_F);
cudaFree(d_TM);
cudaFree(d_S);
cudaDeviceReset();
return 0;
Also, related, but would making these huge input arrays "shared" variables solve my problem?
Many thanks.

So I read through your code and it seems like only one of those 3 arrays are actually going to be causing the stack overflow error. This is assuming your reps doesn't get too big. This array causing the problem is h_F. All you have to do is declare h_F so that it gets placed on the heap instead of the stack, as you said.
This is literally a one line change.
Simply declare h_F like this:
float *h_F = new float[(reps * iterations)];
Good luck!

Related

Accessing memory allocated on CUDA [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I don't really have any experience with CUDA. I have C++ script that looks like the following
for (int i = 0; i < n; ++i) {
// out_data here is a pointer to some chunk of memory on a CPU
out_data[i] = manipulate_out_data_val(out_data[i]);
}
This is currently set up for CPUs. I would like to adapt this to work with GPU allocated arrays, i.e., if out_data was allocated on GPU, how can do I write the above loop?
I tried porting it over as is with a GPU-allocated array, and the program seg-faults.
I'm not sure if this is relevant, but manipulate_out_data_val applies a constant scaling factor to the input value and then adds a constant to the resulting scaled value.
So firstly, I will convert your function into a CUDA kernel which looks something like this.
__global__ void manipulate_out_data_val(int *array)
{
// Assuming `20` is just a scaling factor.
array[threadIdx.x] *= 20;
}
Please note that for loops will not be used anymore because of the threadIdx parameter that is provided by CUDA. The thread's index replaces the i from your for loop. Please refer to this document to learn more about the CUDA's threading model.
Lets assume the array can store up to 100 integers.
int n = 100;
int bytes = n * sizeof(int);
Initialise an array on the CPU first.
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i = 0;i < n;i++) {
arr_cpu[i] = i;
}
Allocate some memory on the GPU
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
Now, you can copy your CPU array to this allocated GPU memory using the cudaMemcpu function. Note that Host indicates CPU and Device indicates GPU as stated here
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
Finally you can run your kernel. Note the number 1 in kernel syntax is number of blocks and n is number of threads per block.
manipulate_out_data_val<<<1, n>>>(arr_gpu);
Wait until the kernel is finished running
cudaDeviceSynchronize();
Finally, you can move the array from GPU back to CPU
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
Please find the whole code here:
#include <iostream>
#include <cuda.h>
using namespace std;
__global__ void manipulate_out_data_val(int *array)
{
// Can add your constant scaling logic here
array[threadIdx.x] *= 20;
}
int main(int argc,char **argv)
{
int n = 100;
int bytes = n * sizeof(int);
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i=0;i<n;i++)
arr_cpu[i]=i;
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
printf("Copying to device..\n");
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
manipulate_out_data_val<<<1, n>>>(arr_gpu);
cudaDeviceSynchronize();
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
for(int i=0;i<n;i++)
printf("%d,", arr_cpu[i]);
cudaFree(arr_gpu);
return 0;
}
Build and run using:
# program.cu is the file containing the code
nvcc program.cu -o program
# Run
./program
The above code has been tested on CUDA 11.4.

CUDA cudaMemCpy doesn't appear to copy despite CudaSuccess

I'm just starting with CUDA and this is my very first project. I've done a search for this issue and while I've noticed other people have had similar problems, none of the suggestions seemed relevant to my specific issue or have helped in my case.
As an exercise, I'm trying to write an n-body simulation using CUDA. At this stage I'm not interested whether my specific implementation is efficient or not, I'm just looking for something that works and I can refine it later. I'll also need to update the code later, once it's working, to work on my SLI configuration.
Here's a brief outline of the process:
Create X and Y position, velocity, acceleration vectors.
Create same vectors on GPU and copy values across
In a loop: (i) calculate acceleration for the iteration, (ii) apply acceleration to velocities and positions, and (iii) copy positions back to host for display.
(Display not implemented yet. I'll do this later)
Don't worry about the acceleration calculation function for now, here is the update function:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
vel_x[i] += acc_x[i];
vel_y[i] += acc_y[i];
pos_x[i] += vel_x[i];
pos_y[i] += vel_y[i];
}
}
And here's some of the code in the main method:
cudaError t;
t = cudaMalloc(&d_pos_x, N * sizeof(double));
t = cudaMalloc(&d_pos_y, N * sizeof(double));
t = cudaMalloc(&d_vel_x, N * sizeof(double));
t = cudaMalloc(&d_vel_y, N * sizeof(double));
t = cudaMalloc(&d_acc_x, N * sizeof(double));
t = cudaMalloc(&d_acc_y, N * sizeof(double));
t = cudaMemcpy(d_pos_x, pos_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_pos_y, pos_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_x, vel_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_y, vel_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_x, acc_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_y, acc_y, N * sizeof(double), cudaMemcpyHostToDevice);
while (true)
{
calc_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
apply_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
t = cudaMemcpy(pos_x, d_pos_x, N * sizeof(double), cudaMemcpyDeviceToHost);
t = cudaMemcpy(pos_y, d_pos_y, N * sizeof(double), cudaMemcpyDeviceToHost);
std::cout << pos_x[0] << std::endl;
}
Every loop, cout writes the same value, whatever random value it was set to when the position arrays were original created. If I change the code in apply_acc to something like:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
pos_x[i] += 1.0;
pos_y[i] += 1.0;
}
}
then it still gives the same value, so either apply_acc isn't being called or the cudaMemcpy isn't copying the data back.
All the cudaMalloc and cudaMemcpy calls return cudaScuccess.
Here's a PasteBin link to the complete code. It should be fairly simple to follow as there's a lot of repetition for the various arrays.
Like I said, I've never written CUDA code before, and I wrote this based on the #2 CUDA example video from NVidia where the guy writes the parallel array addition code. I'm not sure if it makes any difference, but I'm using 2x GTX970's with the latest NVidia drivers and CUDA 7.0 RC, and I chose not to install the bundled drivers when installing CUDA as they were older than what I had.
This won't work:
const int N = 100000;
...
calc_acc<<<1, N>>>(...);
apply_acc<<<1, N>>>(...);
The second parameter of a kernel launch config (<<<...>>>) is the threads per block parameter. It is limited to either 512 or 1024 depending on how you are compiling. These kernels will not launch, and the type of error this produces needs to be caught by using correct CUDA error checking. Simply looking at the return values of subsequent CUDA API functions will not indicate the presence of this type of error (which is why you are seeing cudaSuccess subsequently).
Regarding the concept itself, I suggest you learn more about CUDA thread and block hierarchy. To launch a large number of threads, you need to use both parameters (i.e. niether of the first two parameters should be 1) of the kernel launch config. This is usually advisable from a performance perspective as well.

Copying structure containing 2d pointer to device

I have a question-related to copying structure containing 2D pointer to the device from the host, my code is as follow
struct mymatrix
{
matrix m;
int x;
};
size_t pitch;
mymatrix m_h[5];
for(int i=0; i<5;i++){
m_h[i].m = (float**) malloc(4 * sizeof(float*));
for (int idx = 0; idx < 4; ++idx)
{
m_h[i].m[idx] = (float*)malloc(4 * sizeof(float));
}
}
mymatrix *m_hh = (mymatrix*)malloc(5*sizeof(mymatrix));
memcpy(m_hh,m_h,5*sizeof(mymatrix));
for(int i=0 ; i<5 ;i++)
{
cudaMallocPitch((void**)&(m_hh[i].m),&pitch,4*sizeof(float),4);
cudaMemcpy2D(m_hh[i].m, pitch, m_h[i].m, 4*sizeof(float), 4*sizeof(float),4,cudaMemcpyHostToDevice);
}
mymatrix *m_d;
cudaMalloc((void**)&m_d,5*sizeof(mymatrix));
cudaMemcpy(m_d,m_hh,5*sizeof(mymatrix),cudaMemcpyHostToDevice);
distance_calculation_begins<<<1,16>>>(m_d,pitch);
Problem
With this code I am unable to access 2D pointer elements of the structure, but I can access x from that structure in device. e.g. such as I have receive m_d with pointer mymatrix* m if I initialize
m[0].m[0][0] = 5;
and printing this value such as
cuPrintf("The value is %f",m[0].m[0][0]);
in the device, I get no output. Means I am unable to use 2D pointer, but if I try to access
m[0].x = 5;
then I am able to print this. I think my initializations are correct, but I am unable to figure out the problem. Help from anyone will be greatly appreciated.
In addition to the issues that #RobertCrovella noted on your code, also note:
You are only getting a shallow copy of your structure with the memcpy that copies m_h to m_hh.
You are assuming that pitch is the same in all calls to cudaMemcpy2D() (you overwrite the pitch and use only the latest copy at the end). I think that might be safe assumption for now but it could change in the future.
You are using cudaMemcpyHostToDevice() with cudaMemcpyHostToDevice to copy to m_hh, which is on the host, not the device.
Using many small buffers and tables of pointers is not efficient in CUDA. The small allocations and deallocations can end up taking a lot of time. Also, using tables of pointers cause extra memory transactions because the pointers must be retrieved from memory before they can be used as bases for indexing. So, if you consider a construct such as this:
a[10][20][30] = 3
The pointer at a[10] must first be retrieved from memory, causing your warp to be put on hold for a long time (up to around 600 cycles on Fermi). Then, the same thing happens for the second pointer, adding another 600 cycles. In addition, these requests are unlikely to be coalesced causing even more memory transactions.
As Robert mentioned, the solution is to flatten your memory structures. I've included an example for this, which you may be able to use as a basis for your program. As you can see, the code is overall much simpler. The part that does become a bit more complex is the index calculations. Also, this approach assumes that your matrixes are all of the same size.
I have added error checking as well. If you had added error checking in your code, you would have found at least a couple of the bugs without any extra effort.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
typedef float* mymatrix;
const int n_matrixes(5);
const int w(4);
const int h(4);
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(mymatrix m_d, size_t pitch_floats)
{
// Print the value at [2][3][4].
printf("%f ", m_d[3 + (2 * h + 4) * pitch_floats]);
}
int main()
{
mymatrix m_h;
gpuErrchk(cudaMallocHost(&m_h, n_matrixes * w * sizeof(float) * h));
// Set the value at [2][3][4].
m_h[2 * (w * h) + 3 + 4 * w] = 5.0f;
// Create a device copy of the matrix.
mymatrix m_d;
size_t pitch;
gpuErrchk(cudaMallocPitch((void**)&m_d, &pitch, w * sizeof(float), n_matrixes * h));
gpuErrchk(cudaMemcpy2D(m_d, pitch, m_h, w * sizeof(float), w * sizeof(float), n_matrixes * h, cudaMemcpyHostToDevice));
test<<<1,1>>>(m_d, pitch / sizeof(float));
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
}
Your matrix m class/struct member appears to be some sort of double pointer based on how you are initializing it on the host:
m_h[i].m = (float**) malloc(4 * sizeof(float*));
Copying an array of structures with embedded pointers between host and device is somewhat compilicated. Copying a data structure that is pointed to by a double pointer is also complicated.
For an array of structures with embedded pointers, refer to this posting.
For copying a 2D array (double pointer, i.e. **), refer to this posting. We don't use cudaMallocPitch/cudaMemcpy2D to accomplish this. (Note that cudaMemcpy2D takes single pointer * arguments, you are passing it double pointer ** arguments e.g. m_h[i].m)
Instead of the above approaches, it's recommended that you flatten your data so that it can all be referenced with single pointer referencing, with no embedded pointers.

How to allocate Array of Pointers and preserve them for multiple kernel calls in cuda

I am trying to implement an algorithm in cuda and I need to allocate an Array of Pointers that point to an Array of Structs. My struct is, lets say:
typedef struct {
float x, y;
} point;
I know that If I want to preserve the arrays for multiple kernel calls I have to control them from the host, is that right? The initialization of the pointers must be done from within the kernel. To be more specific, the Array of Struct P will contain random order of cartesian points while the dev_S_x will be a sorted version as to x coordinate of the points in P.
I have tried with:
__global__ void test( point *dev_P, point **dev_S_x) {
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
dev_P[tid].x = 3.141516;
dev_P[tid].y = 3.141516;
dev_S_x[tid] = &dev_P[tid];
...
}
and:
int main( void ) {
point *P, *dev_P, **S_x, *dev_S_x;
P = (point*) malloc (N * sizeof (point) );
S_x = (point**) malloc (N * sizeof (point*));
// allocate the memory on the GPU
cudaMalloc( (void**) &dev_P, N * sizeof(point) );
cudaMalloc( (void***) &dev_S_x, N * sizeof(point*));
// copy the array P to the GPU
cudaMemcpy( dev_P, P, N * sizeof(point), cudaMemcpyHostToDevice);
cudaMemcpy( dev_S_x,S_x,N * sizeof(point*), cudaMemcpyHostToDevice);
test <<<1, 1 >>>( dev_P, &dev_S_x);
...
return 0;
}
which leads to many
First-chance exception at 0x000007fefcc89e5d (KernelBase.dll) in Test_project_cuda.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0020f920..
Critical error detected c0000374
Am I doing something wrong in the cudamalloc of the array of pointers or is it something else? Is the usage of (void***) correct? I would like to use for example dev_S_x[tid]->x or dev_S_x[tid]->y from within the kernels pointing to device memory addresses. Is that feasible?
Thanks in advance
dev_S_x should be declared as point ** and should be passed to the kernel as a value (i.e. test <<<1, 1 >>>(dev_P, dev_S_x);).
Putting that to one side, what you describe sounds like a natural fit for Thrust, which will give you a simpler memory management strategy and access to fast sort routines.

allocate two arrays calling cudaMalloc once

Memory allocation is one of the most time consuming operations in a GPU so I wanted to allocate 2 arrays by calling cudaMalloc once using the following code:
int numElements = 50000;
size_t size = numElements * sizeof(float);
//declarations-initializations
float *d_M = NULL;
err = cudaMalloc((void **)&d_M, 2*size);
//error checking
// Allocate the device input vector A
float *d_A = d_M;
// Allocate the device input vector B
float *d_B = d_M + size;
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
//error checking
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//error checking
The original code is inside the samples folder of the cuda toolkit named vectorAdd.cu so you can assume h_A, h_B are properly initiated and the code works without the modification I made.
The result was that the second cudaMemcpy returned an error with message invalid argument.
It seems that the operation "d_M + size" does not return what someone would expect as device memory behaves differently but I don't know how.
Is it possible to make my approach (calling cudaMalloc once to allocate memory for two arrays) work? Any comments/answers on whether this is a good approach are also welcome.
UPDATE
As the answers of Robert and dreamcrash suggested I had to add number of elements (numElements) to the pointer d_M not the size which is the number of bytes. Just for reference there was no observable speedup.
You just have to replace
float *d_B = d_M + size;
with
float *d_B = d_M + numElements;
This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] you can print its first position by doing printf("%f",*R);.
And the second position? You just do printf("%f\n",*(++R)); thus r[0] + 1. You do not do r[0] + sizeof(float), like you were doing. When you do r[0] + sizeof(float) you will access the element in the position r[4] since size(float) = 4.
When you declare float *d_B = d_M + numElements; the compiler assumes that d_b will be continuously allocated in memory, and each element will have a size of a float. Hence, you do not need to specify the distance in terms of bytes but rather in terms of elements, the compiler will do the math for you. This approach is more human-friendly since it is more intuitive to express the pointer arithmetic in terms of elements than in terms of bytes. Moreover, it is also more portable, since if the number of bytes of a given type changes based on the underneath architecture, the compiler will handle that for you. Consequently, one's code will not break because one assumed a fixed byte size.
You said that "The result was that the second cudaMemcpy returned an error with message invalid argument":
If you print the number corresponding to this error, it will print 11 and if you check the CUDA API you verify that this error corresponds to :
cudaErrorInvalidValue
This indicates that one or more of the parameters passed to the API
call is not within an acceptable range of values.
In your example means that float *d_B = d_M + size; is getting out of the range.
You have allocate space for 100000 floats, d_a will start from 0 to 50000, but according to your code d_b will start from numElements * sizeof(float); 50000 * 4 = 200000, since 200000 > 100000 you are getting invalid argument.