CUDA "Unknown error" for unknown reasons [duplicate] - c++

This question already has an answer here:
Copy an object to device?
(1 answer)
Closed 9 years ago.
In my current project, a call to cudaGetLastError() is returning unknown error and I don't know why. The code compiles just fine, but it is not behaving how I would like it to.
Below is a brief, not compilable example of what the relevant code consists of:
CU_Main.cu
Below is the CUDA kernel:
//My CUDA kernel
__global__ void CU_KernelTest(Kernel* matrix){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
if(idx == 0 && idy == 0){
printf("ID is: %d\n", idx);
matrix->set(1,1, 16.0f);
}
}
Here is the host code:
//A host function which is called when a button is clicked
int HOST_OnbuttonClick(){
Kernel* matrix = new Kernel(3,3,2);
Kernel* device_matrix;
cudaMalloc(&device_matrix, sizeof(Kernel));
cudaMemcpy(device_matrix, matrix, sizeof(Kernel), cudaMemcpyHostToDevice);
CU_KernelTest<<<256, 256>>>(device_matrix);
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(err));
}
cudaFree(device_matrix);
return 0.0f;
}
When matrix->set(1,1, 16.0f); is included in the cuda kernel, (err != cudaSuccess) returns true and prints out UNKNOWN ERROR, whereas if I comment set out, i get no error.
The other struct relevant to this is my own helper for a convolution kernel design I'm going for, naturally called Kernel.
Kernel.cuh
struct Kernel {
private :
float* kernel;
int rows;
int columns;
public :
__device__ __host__
Kernel(int _rows, int _columns, float _default) {
rows = _rows;
columns = _columns;
kernel = new float[rows * columns];
for(int r = 0; r < rows; r++){
for(int c = 0; c < columns; c++){
kernel[r * rows + c] = _default;
}
}
}
__device__ __host__
void set(int row, int col, float value){
kernel[row * rows + col] = value;
}
}
The goal of this design is to be able to set all values for the kernel on the host, send it to the CUDA kernel, set values there and then retrieve the updated object back at the host.
So, there are two issues really, why would I get an unknown error message, and is the code syntactically correct that it should work?
Let me know if more information is needed.
Here are the results of the memory checker:
Nsight Debug
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
CUcontext = 071c7340
CUstream = 08f3e3b8
CUmodule = 08fa97a8
CUfunction = 08fdbbe8
FunctionName = _Z13CU_KernelTestP6Kernel
gridDim = {1,1,1}
blockDim = {256,1,1}
sharedSize = 128
Parameters:
matrix = 0x06b60000 {kernel = 0x07a31718 ???, rows = 3, columns = 3}
Parameters (raw):
0x06b60000
GPU State:
Address Size Type Mem Block Thread blockIdx threadIdx PC Source
-----------------------------------------------------------------------------------------------
07a31728 4 adr st g 0 0 {0,0,0} {0,0,0} 000260 c:\users
Summary of access violations:
c:\users....kernel.cuh(26): error MemoryChecker: #misaligned=0 #invalidAddress=2

Your Kernel class contains a pointer. When you copy the class to the device, you have a host pointer on the device. Dereferencing that on the device gives you this invalid address access violation.
This seems to be a regular cause for confusion. Robert Crovella has just explained it yesterday.

Related

do cuda kernel sets stride for matrix automatically even if stride not initialized in host code? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
i am studying cuda c and the source i am using use cuda sample programs specifically matrix multiply at runtime.
i am following the code line by line and try to predict the next step to be sure i understand the code.
during this i found the struct declaration of Matrix which has data member stride .
the whole code has no single line initializing this stride data member.
i used nsight to debug the device code and normal vs debugger to debug host code >>>>>there was surprise:
the host code really does not initialize this data member till program ends successfully.
but nsight shows even before the first kernal line that the stride is initialized.
when i looked at autos window of vs debugger of the call to kernel ,i noticed that the function name line of kernel shows __cuda_0 matrix with same strucure as the program Matrix struct but with initialized stride?????
so i do not know when and who initialized this stride variable on device code???
thanks alot
this is the struct for matrix
typedef struct
{ int width;
int height;
float* elements;
int stride;
} Matrix;
this is the main code which initialize matrix without stride
int main(int argc, char* argv[])
{
Matrix A, B, C;
int a1, a2, b1, b2;
a1 = atoi(argv[1]); /* Height of A */
a2 = atoi(argv[2]); /* Width of A */
b1 = a2; /* Height of B */
b2 = atoi(argv[3]); /* Width of B */
A.height = a1;
A.width = a2;
A.elements = (float*)malloc(A.width * A.height * sizeof(float));
B.height = b1;
B.width = b2;
B.elements = (float*)malloc(B.width * B.height * sizeof(float));
C.height = A.height;
C.width = B.width;
C.elements = (float*)malloc(C.width * C.height * sizeof(float));
for(int i = 0; i < A.height; i++)
for(int j = 0; j < A.width; j++)
A.elements[i*A.width + j] = (rand() % 3);//arc4random
for(int i = 0; i < B.height; i++)
for(int j = 0; j < B.width; j++)
B.elements[i*B.width + j] = (rand() % 2);//arc4random
MatMul(A, B, C);
the whole code is present in :CUDA C Programming Guide
chapter 3-2-3
ok i got -4 till now and may be the purpose of question is not clear:
in MatMul host function there are lines which declare and initialize the device copies of matrice used and it uses A.width to initialize the d_A.stride ....
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width;
d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);
but when you get to :
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
it invoke the MatMulKernel and in this device code "which depends only on device memory" you find these lines :
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
which takes Matrix A as argument......here i see my confusion reason!!!!
the MatMulKernel uses the name A to refer to d_A matrix passed to it...
so later on at these lines:
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
it calls another device function called GetSubMatrix passing A which is really d_A to it then in GetSubMatrix code it uses A.stride which is really d_A.stride
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
***Asub.stride = A.stride;***
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}
So the host code struct really does not initialize A.stride
and there is no hidden mechanism to deduct A.stride from matrix like structre in cuda ..
but the use of name A in both host code and device code for 2 different matrices lead to my confusion.
problem solved.
the use of name A to refer to host code matrix and in device code of GetSubMatrix to refer to d_A matrix lead to confusion because the data member stride of struct Matrix is not initialized in host code matrix but it will be initialized in device copy d_A matrix,
and this d_A will be passed to GetSubMatrix by argument named A which has stride defined.
so we have 2 matrices with name A one in host undefined and the other in device which is defined so i had this misunderstanding.
if they changed name of argument in GetSubMatrix from A to any thing else there would not have been confusion about stride data member.

Why do I get a seg fault when I try input a value in OpenCV?

So I have this piece of code:
if(channels == 3)
type = CV_32FC3;
else
type = CV_32FC1;
cv::Mat M(rows,cols,type);
std::cout<<"Cols:"<<cols<<" ColsMat:"<<M.cols<<std::endl;
float * source_data = (float*) M.data;
// copying the data into the corresponding pixel
for (int r = 0; r < rows; r++)
{
float* source_row = source_data + (r * rows * channels);
for (int c = 0; c < cols ; c++)
{
float* source_pixel = source_row + (c * channels);
for (int ch = 0; ch < channels; ch++)
{
std::cout<<"Row:"<<r<<" Col:"<<c<<" Channel:"<<ch<<std::endl;
std::cout<<"Type check: "<<typeid(T_M(0,r,c,ch)).name()<<std::endl;
float* source_value = source_pixel + ch;
*source_value = T_M(0, r, c, ch);
}
}
}
T_M is an Eigen::Tensor
I first thought that I got the error from T_M but it isn't the case.
I tried accessing *source_value and I am mostly sure that is the source of the error.
Funny thing is that I don't get the error in the end or the beginning. I get the seg fault around the middle.
For example, with rows: 915, cols: 793, and channels:1
I get the error at Row:829 Col:729 Channel:0.
I can't figure out the source of this error.
you compute your row pointer wrong, should be cols instead of rows:
float* source_row = source_data + (r * cols * channels);
In general, you must be very careful when you use a flat representation of a matrix, it's really error-prone.
The answer from Jean-François Fabre will work, if the matrix is continuous. If you can't be sure about that (e.g. if the matrix is provided by someone else, if you use submatrixes, etc.), you should use the widthstep feature to compute the row pointer:
float* source_row = (float*)(M.data + r*M.step);
this automatically uses the right number of channels, padding, etc.
even simpler is to use the row-ptr function directly:
float* source_row = (float*)(M.ptr(r));

Why is my CUDA kernel returning old values?

Kind of almost at the point of ripping my hair out over this issue.
I have a CUDA kernel that does some math on data stored in a 3D array. While testing this, I used to assign some values (non-zero) to the array and observe results. I commented out those lines since, but the result is still the same. It is as if it is completely ignoring the fact that I'm doing a memset to 0.
The code works correctly when I step through it in Debug... But not in Release! My guess is I have a memory leak from this matrix.
I allocate this array as:
cudaExtent m_extent = make_cudaExtent(sizeof(float)*matdim.x, matdim.y, matdim.z); // width, height, depth
cudaPitchedPtr m_device;
cudaMalloc3D(&m_device, m_extent);
cudaMemset3D(m_device, 0, m_extent);
I call the kernel in a loop like this:
for (int iter = 0; iter < gpu_iterations; iter++)
{
PF_iteration_kernel<<<grids,threads>>>(m_device, m_extent, matdim);
cudaDeviceSynchronize();
}
After which I release the m_device pitched pointer:
cudaFree(m_device.ptr);
matdim is just matrix dimensions held by a dim3.
Within the kernel I do the following (well, I commented everything functional out...):
__global__ void PF_iteration_kernel(cudaPitchedPtr mPtr, cudaExtent mExt, dim3 matrix_dimensions)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// Find location within the pitched memory
char *m = (char*)mPtr.ptr;
int sof = sizeof(float);
size_t pitch = mPtr.pitch;
size_t slice_pitch = pitch*mExt.height;
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff); // display the slice
*m_addroff = 0; // WILL THIS RESET IT?!
__syncthreads();
}
That should be just showing 0s, but it displays my old values (25, 26, 27, 28, etc).
I have cleaned and re-cleaned and re-built everything several times. I have relaunched the IDE.
My IDE is Visual Studio 2010 With NSight 4.6 (CUDA 7.0).
I am on Windows 7 x64
Consider this
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff);
The compiler will see a char and promote it to int pushed on to stack - not a float promoted to double that the format requires.
The compiler does not provide arguments to fit the format spec, but some compilers will examine the format specs and warn of problems.
I suggest you cast the argument. I risk guessing and failing, but something like this
printf("m(%d,%d) is %f \n", x, y, *(float*)m_addroff);
Herer is a simple example.
#include <stdio.h>
int main()
{
char car [4] = {0};
char *cptr = car;
printf ("Hello %f\n", *(float*)cptr);
return 0;
}

Wrong results with CUDA threads writing on private locations in global memory

EDIT 3:
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I'll list the main variables and structures involved.
Variables:
srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
idx (device) : identification number of current thread
My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx][COLORLEVELS] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.
Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I'm debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run.. So I was wondering if you could suggest any other solution to my problem.
Thank you.
#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
#include "cuPrintf.cu"
using namespace std;
#define ARRDIM 512
#define COLORLEVELS 4
__global__ void gpuKernel
(
float *sa, float *aux,
size_t memPitchAux, int w,
float *c_glob
)
{
float sc_loc[COLORLEVELS];
float g0=0.0f;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int idx = tidy * memPitchAux/4 + tidx;
for(int ic=0; ic<COLORLEVELS; ic++)
{
sc_loc[ic] = ((float)(ic*ic));
}
for(int is=0; is<COLORLEVELS; is++)
{
int ic = fabs(sa[tidy*w +tidx]);
c_glob[tidy * COLORLEVELS + tidx + ic] += 1.0f;
}
for(int ic=0; ic<COLORLEVELS; ic++)
{
g0 += c_glob[tidy * COLORLEVELS + tidx + ic]*sc_loc[ic];
}
aux[idx] = g0;
}
int main(int argc, char* argv[])
{
/*
* array src host and device
*/
int heightSrc = ARRDIM;
int widthSrc = ARRDIM;
cudaSetDevice(0);
float *srcArr_h, *srcArr_d;
size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;
srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero
int totArrElm = heightSrc*widthSrc;
for(int ic=0; ic<totArrElm; ic++)
{
srcArr_h[ic] = (float)(rand() % COLORLEVELS);
}
cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);
/*
* auxiliary buffer auxD to save final results
*/
float *auxD;
size_t auxDPitch;
cudaMallocPitch((void**)&auxD,&auxDPitch,widthSrc*sizeof(float),heightSrc);
cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);
/*
* auxiliary buffer auxH allocation + initialization on host
*/
size_t auxHPitch;
auxHPitch = widthSrc*sizeof(float);
float *auxH = (float *) malloc(heightSrc*auxHPitch);
/*
* kernel launch specs
*/
int thpb_x = 16;
int thpb_y = 16;
int blpg_x = (int) widthSrc/thpb_x;
int blpg_y = (int) heightSrc/thpb_y;
int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;
/*
* c_glob: array that reserves a private location of COLORLEVELS floats for each thread
*/
int cglob_w = COLORLEVELS;
int cglob_h = num_threads;
float *c_glob_d;
size_t c_globDPitch;
cudaMallocPitch((void**)&c_glob_d,&c_globDPitch,cglob_w*sizeof(float),cglob_h);
cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);
/*
* kernel launch
*/
dim3 dimBlock(thpb_x,thpb_y, 1);
dim3 dimGrid(blpg_x,blpg_y,1);
gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);
cudaThreadSynchronize();
cudaMemcpy2D(auxH,auxHPitch,
auxD,auxDPitch,
auxHPitch, heightSrc,
cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float min = auxH[0];
float max = auxH[0];
float f;
string str;
for(int i=0; i<widthSrc*heightSrc; i++)
{
if(min > auxH[i])
min = auxH[i];
if(max < auxH[i])
max = auxH[i];
}
cudaFree(srcArr_d);
cudaFree(auxD);
cudaFree(c_glob_d);
}
You decided neither not to show the whole code nor a reduced size thereof reproducing your problem. Therefore, it has not been possible to make tests and verify the possible solution below.
I think you have spot the source of the problem: multiple threads are trying to write to the same memory locations in parallel. This is a situation leading to race conditions. For an example, see the fourth slide of the presentation "CUDA C: race conditions, atomics, locks, mutex, and warps".
Race conditions have a brute-force solution: atomic functions. They are described at Section B.12 of the CUDA C Programming Guide. So you can try to fix your problem by changing the line
c[ic] += 1.0f;
to
atomicAdd(&c[ic],1);
You will pay this fix with performance: atomic operations serialize the code to avoid race conditions.
I have mentioned that atomic functions are a brute-force solution to your problem because it can be that, by properly rethinking the implementation, you can find a way to avoid them. But this is not possible to say as of now due to the very few details you provided.

Access vector of pointers to other vectors on a GPU

so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following:
for(int i =0;i<N;i++){
dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');
}
where A[i] will be a 2D matrix of some size.
I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example:
for(int i =0;i<N;i++){
status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}
but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's.
i currently have the following code that compiles:
double **GlobalFVecs_d;
double **GlobalFPVecs_d;
extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){
cudaError_t err;
GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
checkCudaError(err);
for(int i =0; i < numpulsars;i++){
err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
checkCudaError(err);
err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
checkCudaError(err);
}
err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
checkCudaError(err);
}
but if i now try and access it with:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;
for(int i =0; i < numpulsars; i++){
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
}
it seg faults here, is this not how to get at the data?
The kernal function that i'm calling is just:
__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {
// Each thread computes one element of C
// by accumulating results into Cvalue
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int subrow=row-thispulsar*numcoeff;
int subcol=row-thispulsar*numcoeff;
__syncthreads();
if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;
FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];
}
What am i not doing right? Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works.
Do you think this is the best way to go about doing this? If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that.
Cheers
Lindley
change this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
to this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);
and I believe it will work.
Your methodology of handling pointers is mostly correct. However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. But we are not allowed to dereference device pointers in host code.
However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. This array (GlobalFPVecs_d) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel.
It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself.