out of stack space error ( stack overflow error) - c++

I'm trying to calculate a matrice multiplication of size N (square matrix) but I'm getting a stack overflow error(I'm new to Cuda ):
if I test the code for N < 300 everything is fine, but if I test it with N> 300 it does not work, and a stack overflow error was displayed but there is enough memory.in my graphics card GF 820M .
if N = 300 then 300 * 300 * 4(size of float) = 360000 byte : necessary space in the device to allocate for an array of type float.and here it must allocate for 3 Table to do multiplication .therefore 360000 * 3 = 1080000 bytes and if I control the CudaMalloc nothing is displayed.
I inform you that my main goal is to test for N large enough.How do I solve that? thank you in advance for any help you might be able to provide.
#include <stdio.h>
#include<device_launch_parameters.h>
#include<cuda.h>
#include<time.h>
#include<cuda_runtime.h>
#include <math.h>
__global__ void MatrixMul( float *Md , float *Nd , float *Pd , const int WIDTH )
{ // calculate thread id
unsigned int row = blockIdx.y*blockDim.y+threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x+threadIdx.x;
for (int k = 0 ; k<WIDTH ; k++ )
{ Pd[row*WIDTH + col]+= Md[row * WIDTH + k ] * Nd[ k * WIDTH + col] ; }}
int main ()
{ const int i=64 ;
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
const int WIDTH =300;
cudaError_t cudaStatus;
float array1_h[WIDTH][WIDTH] ,array2_h[WIDTH][WIDTH] ,M_result_array_h[WIDTH][WIDTH];
float *array1_d , *array2_d ,*M_result_array_d ; // device array
// Allocate GPU buffers for 2 vectors (two input, one output)
cudaStatus = cudaMalloc((void **) &array1_d , WIDTH*WIDTH*sizeof (float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!"); }
cudaStatus = cudaMalloc((void **) &array2_d , WIDTH*WIDTH*sizeof (float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!"); }
for ( int i = 0 ; i<WIDTH ; i++ ) {
for (int j = 0 ; j<WIDTH ; j++ )
{ array1_h[i][j] = 1 ; array2_h[i][j] = 2 ; }}
//copy host array to device array; cudaMemcpy ( dest , source , WIDTH , direction )
cudaMemcpy ( array1_d , array1_h , WIDTH*WIDTH*sizeof (float) , cudaMemcpyHostToDevice ) ;
cudaMemcpy ( array2_d , array2_h , WIDTH*WIDTH*sizeof (float) , cudaMemcpyHostToDevice ) ;
//allocating memory for resultent device array
cudaStatus = cudaMalloc((void **) &M_result_array_d , WIDTH*WIDTH*sizeof (float) ) ;
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!"); }
//calling kernal
dim3 dimBlock( i,i, 1 ) ;
dim3 dimGrid ( ((WIDTH-1)/i) +1 , ((WIDTH-1)/i)+1 ,1 ) ;
cudaEventRecord(start, 0);
MatrixMul <<<dimGrid,dimBlock>>> ( array1_d , array2_d ,M_result_array_d , WIDTH) ;
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf ("taille du probleme:%d Time for the kernel: %f \n",WIDTH,time);
//copy back result_array_d to result_array_h
cudaMemcpy(M_result_array_h , M_result_array_d , WIDTH*WIDTH*sizeof(float) , cudaMemcpyDeviceToHost) ;
//printf the result array
for (int i = 0 ; i<WIDTH ; i++ )
{ for (int j = 0 ; j < WIDTH ; j++ )
{ printf ("%f ",M_result_array_h[i][j] ) ; }
printf ("\n") ; }
cudaFree(array1_d);
cudaFree(array2_d);
cudaFree(M_result_array_h);
system("pause") ; }

The stack overflow problem is not CUDA related. These allocations:
float array1_h[WIDTH][WIDTH] ,array2_h[WIDTH][WIDTH] ,M_result_array_h[WIDTH][WIDTH];
are created by the compiler on the stack. The stack space is limited. (This is the host code, so the stack here has nothing to do with the GPU.)
One possible approach to fix this is to create dynamic allocations for these variables, which will be made on the heap, which doesn't have the same limits as the stack.
So one possible fix is to replace this:
float array1_h[WIDTH][WIDTH] ,array2_h[WIDTH][WIDTH] ,M_result_array_h[WIDTH][WIDTH];
with this:
typedef float ar_type[WIDTH];
ar_type *array1_h, *array2_h, *M_result_array_h;
array1_h = (ar_type *)malloc(WIDTH*WIDTH*sizeof(float));
array2_h = (ar_type *)malloc(WIDTH*WIDTH*sizeof(float));
M_result_array_h = (ar_type *)malloc(WIDTH*WIDTH*sizeof(float));
Also note that this:
const int i=64 ;
...
dim3 dimBlock( i,i, 1 ) ;
is not valid. You are requesting a 64x64 threadblock (4096 threads total) and this is not legal for any CUDA GPU. You can fix this particular issue by changing i to 32.
After fixing that, it seems that your kernel has no thread-check to prevent out-of-bounds threads from executing and generating out-of-bounds accesses. You can fix that by adding this thread-check immediately before the for-loop in your kernel:
if ((row < WIDTH) && (col < WIDTH))
Finally, this line has a typo:
cudaFree(M_result_array_h);
I think you meant:
cudaFree(M_result_array_d);
You can discover these other errors (2-4) if you add proper cuda error checking to your code, and/or run your code with cuda-memcheck.

Use rtContextGetStackSize/rtContextSetStackSize to find out how large your stack is and set it larger if needed.
Keep in mind that the memory on your graphic card is shared with other graphical processes and you can't use all of it.
Furthermore you can partition your matrix and compute a Partitioned Matrix Multiplication with a block-by-block algorithm, instead of the entire Matrix.

Related

2D tiled convolution taking more time than untiled version

Writing a code that perform a 2D convolution on a float matrix, in both tiled and untiled version. I'm assuming the width of the tile as
BLOCK_SIZE - MASK_WIDTH + 1
, using halo cells.
But for a 1024 matrix and masks varing from 3 to 9 I get the untiled version performing better:
untiled version
vs
tiled
Both matrix and mask are defined in a constant manner, equal for tiled and untiled. No random values/sizes used.
I guess I'm doing some wrong assumption about the tile size, but even after doing some research the implementation seems quite legit.
#define MATRIX_SIZE 1024
#define BLOCK_WIDTH 32
Here's the kernel code for the tiled version
__global__ void convolution_2D_tiled(float* in, const float* __restrict__ mask, float* out, size_t mask_width, size_t w, size_t h) {
float outputPixel = 0; //minimize write to global memory: stored in register
int tx = threadIdx.x;
int ty = threadIdx.y;
int tile_width = BLOCK_WIDTH - mask_width + 1; //since BLOCK_WIDTH = TILE_WIDTH + MASK_WIDTH - 1
int col = blockIdx.x * tile_width + tx;
int row = blockIdx.y * tile_width + ty;
//picking the starting indexes of input matrix inside the mask
//(TOP-LEFT of the mask)
int inputRow = row - (mask_width / 2);
int inputCol = col - (mask_width / 2);
__shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH];
// Load tile elements
if (inputRow >= 0 && inputRow < h && inputCol >= 0 && inputCol < w)
tile[ty][tx] = in[inputRow * w + inputCol];
else
tile[ty][tx] = 0.0;
// Wait until all tile elements are loaded
__syncthreads();
//some thread won't write any outputs, only need to calculate tile_width elements
if (col < w && row < h && ty < tile_width && tx < tile_width) {
//get the neighbour in the mask
for (int i = 0; i < mask_width; ++i) {
for (int j = 0; j < mask_width; ++j) { //(Mask_Width^2) access for each thread in block -> for each block (Mask_Width^2) * (Block_width^2)
outputPixel += tile[i + ty][j + tx] * mask[i * mask_width + j];
}
}
out[(row * w) + col] = (float)(outputPixel);
}
}
The main with the matrix generation and sizes assumptions:
void errorCheck(unsigned int line){
cudaError_t cudaError = cudaGetLastError();
// if error code wasn't a code describing success
if (cudaError != cudaSuccess)
{
// output that there has been a CUDA error in the line of the CUDA function call
// and exit the program
printf("CUDA error in line %u in file %s: %s\n", line - 1, __FILE__, cudaGetErrorString(cudaError));
exit(EXIT_FAILURE);
}}
int main(int argc, char const* argv[]){
for (size_t mask_width = 3; mask_width <= 9; mask_width += 2) {
printf("Testing with mask size = %d\n\n", mask_width);
float* a;
float* b;
float* c;
cudaMallocManaged((void **) &a, sizeof(float)*MATRIX_SIZE*MATRIX_SIZE);
cudaMallocManaged((void **) &b, sizeof(int)*mask_width*mask_width);
cudaMallocManaged((void **) &c, sizeof(int)*MATRIX_SIZE*MATRIX_SIZE);
// initialize matrix A
for (int i = 0; i < MATRIX_SIZE; ++i) {
for (int j = 0; j < MATRIX_SIZE; ++j) {
a[i * MATRIX_SIZE + j] = (float)(1 +(3 * j % 20));
}
}
// initialize matrix B
for (int i = 0; i < mask_width; ++i) {
for (int j = 0; j < mask_width; ++j) {
b[i * mask_width + j] = (float)(1 + (((2 * i) + j) % mask_width));
}
}
float naive_gpu_elapsed_time_ms;
// some events to count the execution time
//clock_t st, end;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int tile_width = BLOCK_WIDTH - mask_width + 1;
dim3 dimGrid(MATRIX_SIZE / tile_width, MATRIX_SIZE / tile_width);
dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH);
errorCheck(__LINE__);
cudaEventRecord(start, 0);
convolution_2D_tiled <<<dimGrid, dimBlock >>> (a, b, c, mask_width, MATRIX_SIZE, MATRIX_SIZE);
errorCheck(__LINE__);
cudaThreadSynchronize();
//time counting terminate
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
//compute time elapsed on GPU computing
cudaEventElapsedTime(&naive_gpu_elapsed_time_ms, start, stop);
printf("Time elapsed on naive GPU convolution 2d tiled ( %d ) block %f ms.\n\n", BLOCK_WIDTH, naive_gpu_elapsed_time_ms);
//free memory
cudaFree(a);
cudaFree(b);
cudaFree(c);
printf("________________________________________________________________________\n\n");
}
return 0;
}
I'm using google colab with Tesla T4 GPU, and no CUDA error is thrown.
Also tried to use bigger masks (11, 15 ..) but no changes in comparison between tiled and untiled.
You are making inefficient usage of managed memory as discussed here and here.
Nearly all of your ~2ms of execution time is used in inefficient demand-paged copying of data from host to device. As a result, your ability to resolve the difference in performance in the two cases due to the device code changes is almost completely obscured.
If you add these 3 lines of code immediately before float naive_gpu_elapsed_time_ms;, you will observe that your reported execution times decrease dramatically, and you should be able to better judge the performance difference between the shared memory tiled version and the non-tiled version:
cudaMemPrefetchAsync(a, sizeof(float)*MATRIX_SIZE*MATRIX_SIZE, 0);
cudaMemPrefetchAsync(b, sizeof(int)*mask_width*mask_width, 0);
cudaMemPrefetchAsync(c, sizeof(int)*MATRIX_SIZE*MATRIX_SIZE, 0);
You haven't shown your non-tiled code, so I can't demonstrate that for you. Here's an example profiling output using a non-tiled convolution code that I wrote, comparing to your tiled kernel, and including the cudaMemPrefetchAsync() statements:
$ nvprof ./t2140
Testing with mask size = 3
==13236== NVPROF is profiling process 13236, command: ./t2140
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.032832 ms.
________________________________________________________________________
Testing with mask size = 5
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.061120 ms.
________________________________________________________________________
Testing with mask size = 7
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.086080 ms.
________________________________________________________________________
Testing with mask size = 9
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 0.118688 ms.
________________________________________________________________________
==13236== Profiling application: ./t2140
==13236== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 52.59% 311.69us 4 77.922us 41.089us 119.08us convolution_2D(float*, float const *, float*, unsigned long, unsigned long, unsigned long)
47.41% 280.97us 4 70.241us 28.449us 114.28us convolution_2D_tiled(float*, float const *, float*, unsigned long, unsigned long, unsigned long)
API calls: 96.10% 365.32ms 12 30.443ms 12.906us 365.10ms cudaMallocManaged
1.32% 5.0301ms 4 1.2575ms 586.91us 3.2433ms cuDeviceTotalMem
0.66% 2.4917ms 404 6.1670us 320ns 268.82us cuDeviceGetAttribute
0.56% 2.1277ms 12 177.31us 8.3020us 578.90us cudaMemPrefetchAsync
0.50% 1.9035ms 4 475.88us 295.08us 549.01us cudaDeviceSynchronize
0.49% 1.8594ms 12 154.95us 75.533us 328.85us cudaFree
0.14% 526.53us 4 131.63us 42.014us 220.14us cudaEventSynchronize
0.11% 399.28us 4 99.820us 61.310us 210.74us cuDeviceGetName
0.09% 351.52us 8 43.940us 11.426us 116.52us cudaLaunchKernel
0.01% 45.911us 8 5.7380us 4.1870us 10.243us cudaEventRecord
0.01% 25.946us 8 3.2430us 935ns 10.182us cudaEventCreate
0.01% 21.643us 4 5.4100us 3.1450us 8.6700us cuDeviceGetPCIBusId
0.00% 10.304us 8 1.2880us 430ns 5.0980us cuDeviceGet
0.00% 9.6790us 4 2.4190us 1.9560us 3.7180us cudaEventElapsedTime
0.00% 3.3390us 3 1.1130us 617ns 1.6520us cuDeviceGetCount
0.00% 3.2480us 4 812ns 700ns 1.0470us cuDeviceGetUuid
0.00% 3.1420us 8 392ns 229ns 1.2110us cudaGetLastError
==13236== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
12 1.3346MB 4.0000KB 2.0000MB 16.01563MB 1.405760ms Host To Device
Total CPU Page faults: 52
$
You can see that in each case, the tiled/shared memory kernel is faster.

How to fix my block and grid layout to handle large data?

I'm trying to find the closest pair naive algorithm which has 3D coord.
Input is two files which contains 3 floats in one line.
I handled inputs with float3* type variable.
float3* teamA;
float3* teamB;
float3* results;
handleFileInput(argv[1], argv[2], teamA, teamB, numPoints);
results = new float3[numPoints[0]];
after this, I allocated and copied host data to device like this
#define CHECKERROR(val) { if (val != cudaSuccess) {fprintf(stderr, "Error %s at line %d in file %s\n", cudaGetErrorString(val), __LINE__, __FILE__); exit(1);} }
CHECKERROR(cudaMalloc(&d_tA, sizeof(float3) * numPoints[0]));
CHECKERROR(cudaMemset(d_tA, 0, sizeof(float3) * numPoints[0]));
CHECKERROR(cudaMalloc(&d_tB, sizeof(float3) * numPoints[1]));
CHECKERROR(cudaMemset(d_tB, 0, sizeof(float3) * numPoints[1]));
CHECKERROR(cudaMalloc(&d_results, sizeof(float3) * numPoints[0]));
CHECKERROR(cudaMemset(d_results, 0, sizeof(float3) * numPoints[0]));
CHECKERROR(cudaMemcpy(d_tA, teamA, sizeof(float3) * numPoints[0], cudaMemcpyHostToDevice));
CHECKERROR(cudaMemcpy(d_tB, teamB, sizeof(float3) * numPoints[1], cudaMemcpyHostToDevice));
I set my block, grid like this.
dim3 block(512);
dim3 grid(ceil((float)numPoints[0] / 512);
naive_algorithm <<< block, grid >>> (d_tA, d_tB, d_results, numPoints[0], numPoints[1]);
My kernel code is simple like this
__global__ void naive_algorithm(float3* d_tA, float3* d_tB, float3* d_r, int a_size, int b_size)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < a_size)
{
float min_distance = -1;
for (int y = 0; y < b_size; y++)
{
float i = MUL(SUB(d_tA[idx].x, d_tB[y].x), SUB(d_tA[idx].x, d_tB[y].x));
float j = MUL(SUB(d_tA[idx].y, d_tB[y].y), SUB(d_tA[idx].y, d_tB[y].y));
float k = MUL(SUB(d_tA[idx].z, d_tB[y].z), SUB(d_tA[idx].z, d_tB[y].z));
float distance = SQRT(ADD(ADD(i, j), k));
if (min_distance > distance || min_distance == -1)
{
d_r[idx].x = (float)idx;
d_r[idx].y = (float)y;
d_r[idx].z = distance;
min_distance = distance;
}
}
__syncthreads();
}
}
Environment : RTX 2080Ti
There are five different of data samples :
Team A - 1000000 points / Team B - 500000 points -> Test Failed
Team A - 700000 points / Team B - 500000 points -> Test Failed
Team A - 500000 points / Team B - 300000 points -> Test OK!
Team A - 500000 points / Team B - 100000 points -> Test OK!
Team A - 300000 points / Team B - 100000 points -> Test OK!
In my opinion this caused from thread layout.
Do I have to change the block / grid layout 1D by 1D -> 2D by 2D?
Then how should I set my grid layout?
Like Robert Crovella said this was just typo error my mistake.
Because one block can handle to 1024 and grid's one dimension can handle to 65535,
if the grid's x dimension : numPoints[0] / BLOCK_SIZE is bigger than 1024 it doesn't works.
Thanks a lot to check my code!

Why is my CUDA code not working properly for zero filling a large matrix?

It is a simple CUDA code for initializing a big matrix (filling in zeros).
I output the first 1*3 matrix, if the code works. It should be all zeros.
If I set the matrix size to be small, then the program works properly. But when I make the size larger (> 43200 * 2400), what is inside the matrix are all garbage.
I had cudaDeviceSynchronize() append at the end of each CUDA functions already.
I am using NVIDIA Quadro K4200, Xeon E5-2630 with Ubuntu 14.04.
Thanks for anyone helping me here.
Attached below is my full code.
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <cuComplex.h>
#define BLOCK_SIZE 16 // change it to 16 to get maximum performance
// populate the matrix using first row
__global__ void RepmatKernel (cuComplex *Mat, const unsigned int N, const unsigned int Cols)
{
unsigned int i = (unsigned int)blockIdx.x * (unsigned int)blockDim.x + (unsigned int)threadIdx.x;
if (i < N)
{
Mat[i].x = 0;
Mat[i].y = 0;
}
}
// main routine
int main ()
{
const unsigned int Rows = 43200;
const unsigned int Cols = 2400;
const unsigned int Num_thrd = 256; // max threads per block
unsigned int Mat_size = Rows * Cols; // size of array
cuComplex *vec; // supposedly the input
cuComplex *mat_debug; // for debug
vec = new cuComplex [Cols];
mat_debug = new cuComplex [Rows*Cols];
cuComplex *mat_in_d; // device array
//input in host array
for(unsigned int i = 0; i < Cols; i++)
{
vec[i].x = 3*i+4;
vec[i].y = 0.2*i+1;
}
const unsigned int size_mat_d = Rows * Cols * sizeof(cuComplex);
//create device array cudaMalloc ( (void **)&array_name, sizeofmatrixinbytes) ;
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess) std::cout<<"Error allocating GPU";
cudaDeviceSynchronize() ;
//copy host array to device array; cudaMemcpy ( dest , source , WIDTH , direction )
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice ) ;
cudaDeviceSynchronize() ;
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"before repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
RepmatKernel<<<(unsigned int)ceil((float)(Mat_size)/(float)(Num_thrd)),
(Num_thrd)>>>(mat_in_d,
Mat_size,
Cols);
cudaDeviceSynchronize();
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"after repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
cudaFree(mat_in_d);
delete [] vec;
delete [] mat_debug;
return 0;
}
Your call to cudaMalloc states that there is a problem, but doesn't actually terminate the computation. You should put a
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess)
{
std::cout<<"Error allocating GPU\n";
return 1;
}
so that the computation actually stops when you overflow the memory, rather than attempt to work anyway with only a warning to std::cout. Even better would be to use an error handling macro.
Another problem is here:
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice );
First, mat_in_d is size Rows * Cols * sizeof(cuComplex), but you are only copying Cols bytes into it. Even if you only wanted to copy vec into the first part of the mat_in_d vector, you'd need to change this to
cudaMemcpy ( mat_in_d , vec , Cols*sizeof(cuComplex) , cudaMemcpyHostToDevice );
At this point, you'd expect the first Cols entries of you matrix to be reasonable, at the rest to be garbage. (Making the suggested change shows that this is indeed the case; why you would want to do this is a better question).
Next comes your kernel call, whose entire goal is to set the entries of Mat to zero. This should be done with cudaMemset, i.e., just use
cudaMemset(mat_in_d, 0, Mat_size*sizeof(cuComplex));
We could look more carefully at the execution configuration to see what went wrong with your kernel call, but for now this fixes your problem.
For debugging CUDA errors; I find a header from samples, helper_cuda.h, quite convenient. I almost always include this header, which is located in the common directory of samples, in my projects.
Then, wrapping all CUDA calls with checkCudaErrors(), like checkCudaErrors(cudaMalloc((void **) &mat_in_d , size_mat_d)); gives explicit error messages.
In my case, since just mat_in_d is close to 1 GB and my GPU's memory is only 512 MB, it failed for sure and threw cudaErrorMemoryAllocation. However, an NVIDIA Quadro K4200 should not fail that easily!
Did you check the actual available memory information using cudaMemGetInfo ?

CUDA Constant Memory Error

I am trying to do a sample code with constant memory with CUDA 5.5. I have 2 constant arrays of size 3000 each. I have another global array X of size N.
I want to compute
Y[tid] = X[tid]*A[tid%3000] + B[tid%3000]
Here is the code.
#include <iostream>
#include <stdio.h>
using namespace std;
#include <cuda.h>
__device__ __constant__ int A[3000];
__device__ __constant__ int B[3000];
__global__ void kernel( int *dc_A, int *dc_B, int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = dc_A[tid%3000]*X[tid] + dc_B[tid%3000];
}
}
int main()
{
int N=100000;
// set affine constants on host
int *h_A, *h_B ; //host vectors
h_A = (int*) malloc( 3000*sizeof(int) );
h_B = (int*) malloc( 3000*sizeof(int) );
for( int i=0 ; i<3000 ; i++ )
{
h_A[i] = (int) (drand48() * 10);
h_B[i] = (int) (drand48() * 10);
}
//set X and Y on host
int * h_X = (int*) malloc( N*sizeof(int) );
int * h_out = (int *) malloc( N*sizeof(int) );
//set the vector
for( int i=0 ; i<N ; i++ )
{
h_X[i] = i;
h_out[i] = 0;
}
// copy, A,B,X,Y to device
int * d_X, *d_out;
cudaMemcpyToSymbol( A, h_A, 3000 * sizeof(int) ) ;
cudaMemcpyToSymbol( B, h_B, 3000 * sizeof(int) ) ;
cudaMalloc( (void**)&d_X, N*sizeof(int) ) );
cudaMemcpy( d_X, h_X, N*sizeof(int), cudaMemcpyHostToDevice ) ;
cudaMalloc( (void**)&d_out, N*sizeof(int) ) ;
//call kernel for vector addition
kernel<<< (N+1024)/1024,1024 >>>(A,B, d_X, d_out, N);
cudaPeekAtLastError() ;
cudaDeviceSynchronize() ;
// D --> H
cudaMemcpy(h_out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost ) ;
free(h_A);
free(h_B);
return 0;
}
I am trying to run the debugger over this code to analyze. Turns out that on the line which copies to constant memory I get the following error with debugger
Coalescing of the CUDA commands output is off.
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5c5b700 (LWP 31200)]
Can somebody please help me out with constant memory
There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:
__global__ void kernel(int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
}
}
ie. don't try passing A and B to the kernel. The reasons are as follows:
Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.
Even if you did call cudaGetSymbolAddress and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions
Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.

Different number of threads, different answers [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
So I have some neural network simulator code that works correctly on the CPU, and the parallel version agrees with the serial version to at least 6 decimal places with a 32-thread single block on both of my CUDA under Win7 PCs, but with 1 block and 64 threads slightly different values for Wt are generated. Wt values are often no more than 3 decimal places in agreement, and when I attempt to eliminate race conditions by embedding __syncthreads() within the loops, the Wt values appear as Not A Number when copied back to the CPU.
Can someone give me a hint what I might be doing wrong? I've included the parallelized code below, and knlBackProp is being called with lSampleQtyReq=10000, o=1, and Option='R':
// device-global variables to facilitate data transfer
__device__ __constant__ __align__(8) struct rohanContext devSes;
__device__ __constant__ struct rohanLearningSet devLearn;
__device__ __align__(16) struct rohanNetwork devNet;
__device__ double devdReturn[1024*1024];
__device__ double devdRMSE=0;
__device__ int devlReturn[1024*1024];
__device__ int devlTrainable=0;
extern"C"
int knlBackProp(struct rohanContext& rSes, long lSampleQtyReq, long o, char Option)
{mIDfunc /*! divides error in yielded values and back-propagates corrections among weights */
// Option S - single sample correction only
// Option E - keep existing weights, count trainable samples only
// Option R - perform corrections for all trainable samples
int lTotal=0;
cudaMemcpyToSymbol( "devlTrainable", &lTotal, sizeof(int) ); // init return value on both sides
mCheckCudaWorked
cudaEvent_t start, stop;
cudaEventCreate( &start);
cudaEventCreate( &stop);
cudaEventRecord( start, 0);
mtkBackPropMT<<< rSes.iBpropBlocks , rSes.iBpropThreads >>>( lSampleQtyReq, o, Option);
cudaEventRecord( stop, 0);
mCheckCudaWorked
cudaMemcpyFromSymbol( &lTotal, "devlTrainable", sizeof(long) ); // retrieve return value
mCheckCudaWorked
cudaEventSynchronize( stop);
float elapsedTime;
cudaEventElapsedTime( &elapsedTime, start, stop);
conPrintf("DEVICE: Time to complete BackProp kernel: %3.1f ms\n", elapsedTime);
cudaEventDestroy( start);
cudaEventDestroy( stop);
return lTotal;
}
__global__ __device__ void mtkBackPropMT( long lSampleQtyReq, long o, char Option)
{/*! divides error in yielded values and back-propagates corrections among weights */
// Option S - single sample correction only
// Option E - keep existing weights, count trainable samples only
// Option R - perform corrections for all trainable samples
if(Option=='E' || Option=='e'){ //
devlTrainable=0; // reset global mem trainable counter
subkBackPropEoptMT(lSampleQtyReq, o);
}
if(Option=='S' || Option=='s'){
devlTrainable=0; // reset global mem trainable counter
subkBackPropSoptMT(lSampleQtyReq, false, devNet, devNet.Signals, devNet.Zs, devNet.Wt, devNet.Deltas, devLearn.gpuXInputs, devLearn.gpuYEval, devLearn.gpudYEval);
}
if(Option=='R' || Option=='r'){ //
devlTrainable=0; // reset global mem trainable counter
subkBackPropRoptMT(lSampleQtyReq, o);
}
}
__device__ void subkBackPropRoptMT(long lSampleQtyReq, long o)
{/*! flags and counts samples meeting */
long OUTROWLEN=devLearn.iOutputQty+1; // prepare array index and width
//long tIx = threadIdx.x + devSes.iEvalThreads * blockIdx.x; // tIx is thread index over the kernel
long tIx = threadIdx.x + blockDim.x * blockIdx.x; // tIx is thread index over the kernel
//long lTotalThreads = devSes.iBpropThreads * devSes.iBpropBlocks; // total number of threads
double maxSquared = devSes.dMAX * devSes.dMAX ; //needed to compart to stored delta squared values
devlTrainable=0; // clear global mem accumulator; out of bound samples will remain at this value
for (long s=0; s<lSampleQtyReq; ++s){ // iterate over samples
if( devLearn.gpudSE1024[IDX2C( o, s, OUTROWLEN )] > maxSquared ){ // if the MAX criterion is exceeded
if(tIx==0)++devlTrainable; // increment the counter
subkBackPropSoptMT( s, true, devNet, devNet.Signals, devNet.Zs, devNet.Wt, devNet.Deltas, devLearn.gpuXInputs, devLearn.gpuYEval, devLearn.gpudYEval);
}
}
}
__device__ void subkBackPropSoptMT(long s, int o, rohanNetwork& Net, cuDoubleComplex * Signals, cuDoubleComplex * Zs, cuDoubleComplex * Wt, cuDoubleComplex * Deltas, cuDoubleComplex * XInputs, cuDoubleComplex * YEval, double * dYEval )
{/*! propagates adjustment of weights backwards preceeding layers from the chosen network output. */
// s is sample's index
// o is an optional method selection parameter; print/don't print as of 2/29/12
long index, kindex; // for warpwise loops
long tIx = threadIdx.x + blockDim.x * blockIdx.x; // tIx is thread index over the kernel
long lTotalThreads = gridDim.x * blockDim.x; // total number of threads
const cuDoubleComplex cdcZero = { 0, 0 };
/* clear all temp values BP0 */
for (long offset=0; (index =offset+tIx)< MAXNEURONS ; offset+=lTotalThreads){ // index stands for i
Deltas[index]=cdcZero;
Signals[index]=cdcZero;
Zs[index]=cdcZero;
}
/* re-evaluate sample to load temp values. BPI */
subkEvalSampleBetaMT( devSes, s, Net, (s==0), Signals, Zs, Wt, XInputs, YEval, dYEval);
/* begin error calculation. BPII */
cuDoubleComplex Deltastar /* measured error at the chosen network output. */ ;
/* calc top layer deltas. */
long TOP=Net.iLayerQty-1;
int ROWLEN=Net.iNeuronQTY[TOP];
//for(int i=0; i<Net.iNeuronQTY[TOP]; ++i){
for (long offset=0; (index =offset+tIx)< Net.iNeuronQTY[TOP] ; offset+=lTotalThreads){ // index stands for i
// delta-star = D - Y = Desired output minus actual output from evaluation
// D is the cplx coords of the sector of the desired answer Y is the complex result of evaluation of the given sample, unactivated. */
Deltastar = CxSubtractCxUT(
devLearn.gpuDOutputs[ IDX2C( index, s, ROWLEN ) ],
Signals[Net.iNeuronOfst[TOP]+index] );
/* divide the correction; delta = alpha * delta-star / n+1 (but alpha is always 1 for now). */
//Deltas[Net.iNeuronOfst[TOP]+index] = CxDivideRlUT( Deltastar, Net.iDendrtQTY[TOP] );
Deltas[Net.iNeuronOfst[TOP]+index] = CxMultiplyRlUT( Deltastar, Net.dINV_S[TOP] );
}
__syncthreads();
/* Now distribute the correction to lower layers if any. BPII.1 */
if (Net.iLayerQty>2){ /* remember layer 0 = inputs, layer 1 = bottom row, layer {2..iLayerQty-2} = middle row, layer iLayerQty-1 = top row. */
for (int L=Net.iLayerQty-1; L>1; --L){
long LAY = L; /* setup access to layers. */
long TRIB = L-1; /* trib for tributary.*/
int iTributQTY=Net.iNeuronQTY[TRIB];
//int Sj=Net.iDendrtQTY[TRIB]; if (TRIB==1) Sj=1; // Sj=1 for firest hidden layer
for (int i=1; i<Net.iNeuronQTY[LAY]; ++i) { // skip 0th neuron as its weights are either 1 (div identity) or 0 (div forbidden) and don't change anyway
// k index must begin at 1, neuron zero not valid for correction
//for (int k=1; k<iTributQTY; ++k) { /* the contribution to ith neuron's kth tributary's delta = i's delta/i's weight k. */
for (long offset=1; ( kindex =offset+tIx)< iTributQTY ; offset+=lTotalThreads){ // kindex stands for k
Deltas[Net.iNeuronOfst[TRIB]+kindex]
= CxAddCxUT ( Deltas[Net.iNeuronOfst[TRIB]+kindex] ,
CxDivideCxUT(
Deltas[Net.iNeuronOfst[LAY]+i] ,
Wt[IDX2C( Net.iWeightOfst[LAY]+kindex, i, iTributQTY )] ));
}
}
for (long offset=1; ( kindex =offset+tIx)< iTributQTY ; offset+=lTotalThreads){ // kindex stands for k
//cuDoubleComplex preDiv=Deltas[Net.iNeuronOfst[TRIB]+kindex]; // diagnostic purpose only, remove if removing other diags
//Deltas[Net.iNeuronOfst[TRIB]+kindex]
// = CxDivideRlUT(
// Deltas[Net.iNeuronOfst[TRIB]+kindex] ,
// Sj );
Deltas[Net.iNeuronOfst[TRIB]+kindex]
= CxMultiplyRlUT(
Deltas[Net.iNeuronOfst[TRIB]+kindex] ,
Net.dINV_S[TRIB] );
}
}
}
__syncthreads();
/* error distribution completed */
/* and now update the weights BP III */
/* adj weights on first hidden layer. */
int FHID = 1;
int SIG = 0;
int iSignalQTY=Net.iNeuronQTY[SIG]; //rSes.rLearn->iInputQty+1;
int iHidWidth=Net.iNeuronQTY[FHID];
for (int k=1; k<iHidWidth; ++k){
//for (int i=0; i<iSignalQTY; ++i){
for (long offset=0; ( index =offset+tIx)< iSignalQTY ; offset+=lTotalThreads){ // index stands for i
/* dW=d*xbar/s1/|z|= neuron's delta * input's conjugate / ( dendrites+1 * abs of input i ). */
Wt[IDX2C( Net.iWeightOfst[FHID]+index, k, iSignalQTY )]
=CxAddCxUT( Wt[IDX2C( Net.iWeightOfst[FHID]+index, k, iSignalQTY )] ,
CxDivideRlUT(
CxMultiplyCxUT(
Deltas[Net.iNeuronOfst[FHID]+k] ,
CxConjugateUT( Signals[Net.iNeuronOfst[SIG]+index] )
) ,
CxAbsUT( Zs[Net.iNeuronOfst[FHID]+k] ) // N+1 denominator factor is considered redundant - JAW & IA 2/27/12
)
);
}
}
__syncthreads();
/* re-evaluate sample to update temp values. */
subkEvalSampleBetaMT( devSes, s, Net, false, Signals, Zs, Wt, XInputs, YEval, dYEval);
if (Net.iLayerQty>2){
/* now use those outputs' conjugates and the deltas to adjust middle layers. BP III.1 */
for (int L=2; L<Net.iLayerQty-1; ++L){
/* setup access to layers. */
long LAY = L;
long TRIB = L-1;
//int iLayWidth=Net.iNeuronQTY[LAY];
int iTribWidth=Net.iNeuronQTY[TRIB];
for (int k=1; k<Net.iNeuronQTY[LAY]; ++k){
//for (int i=0; i<Net.iNeuronQTY[TRIB]; ++i){
for (long offset=0; ( index =offset+tIx)< Net.iNeuronQTY[TRIB] ; offset+=lTotalThreads){ // index stands for i
/* the adjustment added to kth neuron's ith trib's weight = k's delta * complex conjugate of i's signal / (abs of k's previous-wt product-sum * dendrites+1) . */
Wt[IDX2C( Net.iWeightOfst[LAY]+index, k, iTribWidth )]
=CxAddCxUT( Wt[IDX2C( Net.iWeightOfst[LAY]+index, k, iTribWidth )] ,
CxDivideRlUT(
CxMultiplyCxUT(
Deltas[Net.iNeuronOfst[LAY]+k] ,
CxConjugateUT( Signals[Net.iNeuronOfst[TRIB]+index] )
) ,
(
CxAbsUT( Zs[Net.iNeuronOfst[LAY]+k] ) // N+1 denominator factor is considered redundant - JAW & IA 2/27/12
)
)
);
}
}
/* layer is complete. */
subkEvalSampleBetaMT( devSes, s, Net, true, Signals, Zs, Wt, XInputs, YEval, dYEval);
}
}
__syncthreads();
/* correct output layer BP III.3 */
long SUB = TOP-1;
//int iTopWidth=Net.iNeuronQTY[TOP];
int iSubWidth=Net.iNeuronQTY[SUB];
for (int k=1; k<Net.iNeuronQTY[TOP]; ++k){
//for (int i=0; i<Net.iNeuronQTY[SUB]; ++i){
for (long offset=0; ( index =offset+tIx)< Net.iNeuronQTY[SUB] ; offset+=lTotalThreads){ // index stands for i
/* For last layer only, adjustment to kth neuron's ith weight = k's delta * complex conjugate of i's signal / ( dendrites+1) . */
Wt[IDX2C( Net.iWeightOfst[TOP]+index, k, iSubWidth )]
=CxAddCxUT( Wt[IDX2C( Net.iWeightOfst[TOP]+index, k, iSubWidth )] ,
CxMultiplyCxUT(
Deltas[Net.iNeuronOfst[TOP]+k] ,
CxConjugateUT( Signals[Net.iNeuronOfst[SUB]+index] )
)
); // N+1 denominator factor is considered redundant - JAW & IA 2/27/12
}
}
/* backprop is complete. */
}
__device__ void subkEvalSampleBetaMT(rohanContext& Ses, long s, rohanNetwork& Net, int o, cuDoubleComplex * Signals, cuDoubleComplex * Zs, cuDoubleComplex * Wt, cuDoubleComplex * XInputs, cuDoubleComplex * YEval, double * dYEval )
{// Beta uses fixed length fields instead of nested pointer layers
// delta squared is not updated, since they'll be updated when RMSE is checked at the end of a pass through the learning set
long index, kindex; // for warpwise loops
long tIx = threadIdx.x + blockDim.x * blockIdx.x; // tIx is thread index over the kernel
long lTotalThreads = gridDim.x * blockDim.x; // total number of threads
const cuDoubleComplex cdcZero = { 0, 0 };
/*! layer zero (inputs) is special. */
long INROWLEN=Net.iNeuronQTY[0];//rSes.rLearn->iInputQty+1;
//for (int i=0; i<INROWLEN; ++i){
for (long offset=0; (index =offset+tIx)< INROWLEN ; offset+=lTotalThreads){ // index stands for i
Signals[Net.iNeuronOfst[0]+index]= XInputs[IDX2C( index, s, INROWLEN )];
}
/*! middle and top layers. */
for (int L=1; L<Net.iLayerQty; ++L){
//struct rohanLayer& lay = Net.rLayer[L];
long LAY=L;
int TRIB=L-1; // index of previous layer
int iNeuronQTY=Net.iNeuronQTY[LAY];
int iSignalQTY=Net.iDendrtQTY[LAY]; // signal qty depends on size of previous layer
//for (int k=0; k<iNeuronQTY; ++k){ //Neuron zero is not skipped, its output should be 1+0i as a check
for (long offset=0; (kindex =offset+tIx)< iNeuronQTY ; offset+=lTotalThreads){ // kindex stands for k
Zs[Net.iNeuronOfst[LAY]+kindex]=cdcZero;
for (int i=0; i<iSignalQTY; ++i){ //walk weights on inputs from previous layer
Zs[Net.iNeuronOfst[LAY]+kindex] =
CxAddCxUT( Zs[Net.iNeuronOfst[LAY]+kindex] ,
CxMultiplyCxUT(
Wt[IDX2C( Net.iWeightOfst[LAY] + i, kindex, iSignalQTY )],
Signals[Net.iNeuronOfst[TRIB]+i] ) ) ;
}
// ACTIVATE //
Signals[Net.iNeuronOfst[LAY]+kindex] = CxActivateUT( Zs[Net.iNeuronOfst[LAY]+kindex]);
}
}
/*! last layer values are converted and stored here */
long TOP = Net.iLayerQty-1;
long OUTROWLEN=Net.iNeuronQTY[TOP];
//for (int i=0; i<Net.iNeuronQTY[TOP]; ++i){ // continuous conversion begins here
for (long offset=0; (index =offset+tIx)< OUTROWLEN ; offset+=lTotalThreads){ // index stands for i
YEval[IDX2C( index, s, OUTROWLEN )]= Signals[Net.iNeuronOfst[TOP]+index] ; // store final complex output(s)
dYEval[IDX2C( index, s, OUTROWLEN )]=FUnitCxUT( YEval[IDX2C( index, s, OUTROWLEN )] ) * Net.iSectorQty; // convert final complex outputs to sectors and store that
if(devLearn.iContOutputs==false) // round off decimal if disc activation is set
dYEval[IDX2C( index, s, OUTROWLEN )]=int(dYEval[IDX2C( index, s, OUTROWLEN )]);
}
/*! end of sample evaluation. */
}
__device__ cuDoubleComplex CxActivateUT(const cuDoubleComplex Z)
{/// applies ContActivation or discrete activation function to cx neuron output and returns Phi(Z)
/// This fn should be phased out in favor of a GPU device vector based fn
cuDoubleComplex phi;
if (devNet.bContActivation) { // apply ContActivation activation function to weighted sum : phi(z)=z/|z|
phi = CxDivideRlUT( Z, CxAbsUT( Z ) );
}
else { // apply Discrete activation function to weighted sum : s=int(arctan(z)*k/2pi), phi(z)=(X(s),Y(s))
double theta = atan2(Z.y, Z.x); // theta = arctan y/x
int iSector = (int)((theta * devNet.dK_DIV_TWO_PI) + devNet.iSectorQty) % devNet.iSectorQty;
phi = devNet.gpuSectorBdry[iSector];
//printf(" %f+%fi %d Activate\n", phi.x, phi.y, threadIdx.x);
}
return phi;
}
So, I'm not going to read all that code, but I can give you a strong hint. The warp size is 32 threads, so the 64-thread case will run two warps/block -- in the former case you can't have any instruction pointer based race conditions, however, in the second case, you will effectively have two groups of threads with different IPs scheduled at different times. You may already know much of this (hence the syncthreads), but the above really makes it almost certain that you simply have one more race condition you haven't accounted for yet.
Putting in the sync-threads is a good start to try and isolate it. Are you sure that in your loops, the source data of one warp is not overwritten by the other warp? If not try put in syncthreads into your inner loops just for debug purposes to see what may be causing the race condition.