C++/CUDA: Calculating maximum gridSize and blockSize dynamically - c++

I'm wanting to find a way to dynamically calculate the necessary grid and block size for a calculation. I have run into the issue that the problem that I am wanting to handle is simply too large to handle in a single run of the GPU from a thread limit perspective. Here is a sample kernel setup which runs into the error that I am having:
__global__ void populateMatrixKernel(char * outMatrix, const int pointsToPopulate)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < pointsToPopulate)
{
outMatrix[i] = 'A';
}
}
cudaError_t populateMatrixCUDA(char * outMatrix, const int pointsToPopulate, cudaDeviceProp &deviceProp)
{
//Device arrays to be used
char * dev_outMatrix = 0;
cudaError_t cudaStatus;
//THIS IS THE CODE HERE I'M WANTING TO REPLACE
//Calculate the block and grid parameters
auto gridDiv = div(pointsToPopulate, deviceProp.maxThreadsPerBlock);
auto gridX = gridDiv.quot;
if (gridDiv.rem != 0)
gridX++; //Round up if we have stragling points to populate
auto blockSize = deviceProp.maxThreadsPerBlock;
int gridSize = min(16 * deviceProp.multiProcessorCount, gridX);
//END REPLACE CODE
//Allocate GPU buffers
cudaStatus = cudaMalloc((void**)&dev_outMatrix, pointsToPopulate * sizeof(char));
if (cudaStatus != cudaSuccess)
{
cerr << "cudaMalloc failed!" << endl;
goto Error;
}
populateMatrixKernel << <gridSize, blockSize >> > (dev_outMatrix, pointsToPopulate);
//Check for errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
{
cerr << "Population launch failed: " << cudaGetErrorString(cudaStatus) << endl;
goto Error;
}
//Wait for threads to finish
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
cerr << "cudaDeviceSynchronize returned error code " << cudaStatus << " after launching visit and bridger analysis kernel!" << endl;
cout << "Cuda failure " << __FILE__ << ":" << __LINE__ << " '" << cudaGetErrorString(cudaStatus);
goto Error;
}
//Copy output to host memory
cudaStatus = cudaMemcpy(outMatrix, dev_outMatrix, pointsToPopulate * sizeof(char), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
cerr << "cudaMemcpy failed!" << endl;
goto Error;
}
Error:
cudaFree(dev_outMatrix);
return cudaStatus;
}
Now, when I test this code using the following testing setup:
//Make sure we can use the graphics card (This calculation would be unresonable otherwise)
if (cudaSetDevice(0) != cudaSuccess) {
cerr << "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?" << endl;
}
cudaDeviceProp deviceProp;
cudaError_t cudaResult;
cudaResult = cudaGetDeviceProperties(&deviceProp, 0);
if (cudaResult != cudaSuccess)
{
cerr << "cudaGetDeviceProperties failed!" << endl;
}
int pointsToPopulate = 250000 * 300;
auto gpuMatrix = new char[pointsToPopulate];
fill(gpuMatrix, gpuMatrix + pointsToPopulate, 'B');
populateMatrixCUDA(gpuMatrix, pointsToPopulate, deviceProp);
for (int i = 0; i < pointsToPopulate; ++i)
{
if (gpuMatrix[i] != 'A')
{
cout << "ERROR: " << i << endl;
cin.get();
}
}
I get an error at i=81920. Moreover, if I check the memory before and after the execution, all of the memory values after 81920 go from 'B' to null. It seems that this error is originating from this line in the kernel execution parameter code:
int gridSize = min(16 * deviceProp.multiProcessorCount, gridX);
For my graphics card (GTX 980M) I get out a value for deviceProp.multiProcessorCount of 5, and if I multiply this by 16 and 1024 (for max blocks per grid) I get out the 81920. It seems that, while I am fine on the memory space side of things, I am getting choked by how many threads I can run. Now, this 16 is just being set as an arbitrary value (after looking at some example code my friend made), I was wondering if there was a way to actually calculate "What 16 should be" based on the GPUs properties instead of setting it arbitrarily. I'm wanting to write an iterative code that is able to determine the maximum amount of calculations that are able to be performed at one point in time, and then fill the matrix piece by piece accordingly, but I need to know the maximum calculation value to do this. Does anyone know of a way to calculate these parameters? If any more information is needed, I'm happy to oblige. Thank you!

There is fundamentally nothing wrong with the code you have posted. It is probably close to best practice. But it isn't compatible with the design idiom of your kernel.
As you can see here, your GPU is capable of running 2^31 - 1 or 2147483647 blocks. So you could change the code in question to this:
unsigned int gridSize = min(2147483647u, gridX);
and it should probably work. Better still, don't change that code at all, but change your kernel to something like this:
__global__ void populateMatrixKernel(char * outMatrix, const int pointsToPopulate)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
for(; i < pointsToPopulate; i += blockDim.x * gridDim.x)
{
outMatrix[i] = 'A';
}
}
That way your kernel will emit multiple outputs per thread and everything should should just work as it is intended.

Related

OpenCL code get stuck on write output buffer

I have this peace of code in OpenCL:
std::string src = "__kernel void dot_product(__global float* weights,"
"__global float* values,"
"__global float* result,"
"__const unsigned int sz){"
"float dot = 0.f;"
"unsigned int i;"
"int current_idx = get_global_id(0);"
"unsigned int offset = current_idx * sz;"
"for( i = 0; i < sz; ++i )"
"{"
"dot += weights[ offset + i ] * values[ offset + i ];"
"}"
"result[current_idx] = dot;"
"}";
Which gets stuck on result[current_idx] = dot; If I comment out this code everything works well.
I don't get why it shall get stack.
The relevant c++ code is here:
using namespace cl;
std::array< float, CONST_INPUTS_NUMBER * CONST_NEURONS_NUMBER > in_weights;
std::array< float, CONST_INPUTS_NUMBER * CONST_NEURONS_NUMBER > in_values;
// Create a command queue and use the first device
const std::size_t size = in_weights.size();
std::vector< Device > devices =
m_context.getInfo< CL_CONTEXT_DEVICES >();
Buffer weights(m_context, CL_MEM_READ_ONLY, size * sizeof(float));
Buffer values(m_context, CL_MEM_READ_ONLY, size * sizeof(float));
Buffer product(m_context, CL_MEM_WRITE_ONLY, CONST_NEURONS_NUMBER * sizeof(float));
std::cout << __FILE__ << __LINE__ << std::endl;
// Set arguments to kernel
m_kernel.setArg(0, weights);
m_kernel.setArg(1, values);
m_kernel.setArg(2, product);
m_kernel.setArg(3, CONST_INPUTS_NUMBER);
CommandQueue queue(m_context, devices[0]);
try {
std::vector< float > dotProducts(CONST_NEURONS_NUMBER);
for(std::size_t i = 0; i < CONST_NEURONS_NUMBER; ++i) {
// Create memory buffers
for(std::size_t j = 0; j < CONST_INPUTS_NUMBER; ++j) {
const std::size_t index = i * CONST_INPUTS_NUMBER + j;
in_weights[index] = m_internal[i][j].weight;
in_values[index] = m_internal[i][j].value;
}
}
queue.enqueueWriteBuffer(weights,
CL_TRUE,
0,
in_weights.size() * sizeof(float),
in_weights.data());
queue.enqueueWriteBuffer(values,
CL_TRUE,
0,
in_values.size() * sizeof(float),
in_values.data());
for(std::size_t offset = 0; offset < CONST_NEURONS_NUMBER; ++offset) {
queue.enqueueNDRangeKernel(m_kernel,
cl::NDRange(offset),
cl::NDRange(CONST_INPUTS_NUMBER));
}
std::cout << __FILE__ << __LINE__ << std::endl;
queue.enqueueReadBuffer(product,
CL_TRUE,
0,
CONST_NEURONS_NUMBER * sizeof(float),
dotProducts.data());
std::cout << __FILE__ << __LINE__ << std::endl;
for(std::size_t i = 0; i < CONST_NEURONS_NUMBER; ++i) {
std::cout << __FILE__ << __LINE__ << std::endl;
m_internal[i].calculateOutput(dotProducts.begin(),
dotProducts.end());
}
} catch(const cl::Error& e) {
cl_int err;
cl::STRING_CLASS buildlog =
m_program.getBuildInfo< CL_PROGRAM_BUILD_LOG >(devices[0], &err);
std::cout << "Building error! Log: " << buildlog << std::endl;
}
Which gets stuck on result[current_idx] = dot; If I comment out this
code everything works well. I don't get why it shall get stack.
When you comment out the line where the calculated results are being written to the output buffer then the all calculations quite likely are being removed by optimizer leaving your kernel empty.
I think here is the problem:
for(std::size_t offset = 0; offset < CONST_NEURONS_NUMBER; ++offset) {
queue.enqueueNDRangeKernel(m_kernel, cl::NDRange(offset), cl::NDRange(CONST_INPUTS_NUMBER));
}
Specifically that you loop over enqueuing many kernels that work on the same output buffer which causes each kernel fighting for access to the same one buffer where the results are being overwritten anyway.
You need to enqueue the kernel only once without offset and with CONST_NEURONS_NUMBER global work items:
queue.enqueueNDRangeKernel(m_kernel, cl::NullRange, cl::NDRange(CONST_NEURONS_NUMBER));
CONST_INPUTS_NUMBER is already passed as a kernel argument.

CUDA Global Function Not Adding Array Values Correctly for Some Indexes

I am working through this CUDA video tutorial on Youtube. The code is provided in the second half of the video. It's a simple CUDA program to add the elements of two arrays. So if we had a first array called a and a second called b the final value of a[i] would be:
a[i] += b[i];
The problem is, no matter what I do. The first four elements of the final output are always bizarre numbers. The program creates random inputs for the arrays of 0 to 1000. This means that the final output value at each index should be between zero and 2000. However, regardless of the random seed, the program always outputs a combination of absurdly large (out of range) numbers or zeros for the first four results.
For indexes greater than 3, the outputs seem to be find. Here is my code:
#include <iostream>
#include <cuda.h>
#include <stdlib.h>
#include <ctime>
using namespace std;
__global__ void AddInts( int *a, int *b, int count){
int id = blockIdx.x * blockDim.x +threadIdx.x;
if (id < count){
a[id] += b[id];
}
}
int main(){
srand(time(NULL));
int count = 100;
int *h_a = new int[count];
int *h_b = new int[count];
for (int i = 0; i < count; i++){ // Populating array with 100 random values
h_a[i] = rand() % 1000; // elements in range 0 to 1000
h_b[i] = rand() % 1000;
}
cout << "Prior to addition:" << endl;
for (int i =0; i < 10; i++){ // Print out the first five of each
cout << h_a[i] << " " << h_b[i] << endl;
}
int *d_a, *d_b; //device copies of those arrays
if(cudaMalloc(&d_a, sizeof(int) * count) != cudaSuccess) // malloc for cudaMemcpyDeviceToHost
{
cout<<"Nope!";
return -1;
}
if(cudaMalloc(&d_b, sizeof(int) * count) != cudaSuccess)
{
cout<<"Nope!";
cudaFree(d_a);
return -2;
}
if(cudaMemcpy(d_a, h_a, sizeof(int) * count, cudaMemcpyHostToDevice) != cudaSuccess)
{
cout << "Could not copy!" << endl;
cudaFree(d_a);
cudaFree(d_b);
return -3;
}
if(cudaMemcpy(d_b, h_b, sizeof(int) * count, cudaMemcpyHostToDevice) != cudaSuccess)
{
cout << "Could not copy!" << endl;
cudaFree(d_b);
cudaFree(d_a);
return -4;
}
AddInts<<<count / 256 +1, 256>>>(d_a, d_b, count);
if(cudaMemcpy(h_a, d_a, sizeof(int) * count, cudaMemcpyDeviceToHost)!= cudaSuccess) //magic of int division
{ // copy from device back to host
delete[]h_a;
delete[]h_b;
cudaFree(d_a);
cudaFree(d_b);
cout << "Error: Copy data back to host failed" << endl;
return -5;
}
delete[]h_a;
delete[]h_b;
cudaFree(d_a);
cudaFree(d_b);
for(int i = 0; i < 10; i++){
cout<< "It's " << h_a[i] << endl;
}
return 0;
}
I compiled with:
nvcc threads_blocks_grids.cu -o threads
The result of nvcc -version is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
And here is my output:
Prior to addition:
771 177
312 257
303 5
291 819
735 359
538 404
718 300
540 943
598 456
619 180
It's 42984048
It's 0
It's 42992112
It's 0
It's 1094
It's 942
It's 1018
It's 1483
It's 1054
It's 799
You deleted host arrays before printing. This has an undefined behavior if you use <= C++11, and implementation defined bahavior for > C++11.
If you move the print part up, it should be solved.

Power spectrum from FFTW not working, but in MATLAB it does

I am trying to do a power spectrum analysis with FFTW using this Code below:
#define ALSA_PCM_NEW_HW_PARAMS_API
#include <iostream>
using namespace std;
#include <alsa/asoundlib.h>
#include <fftw3.h>
#include <math.h>
float map(long x, long in_min, long in_max, float out_min, float out_max)
{
return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min;
}
float windowFunction(int n, int N)
{
return 0.5f * (1.0f - cosf(2.0f *M_PI * n / (N - 1.0f)));
}
int main() {
//FFTW
int N=8000;
float window[N];
double *in = (double*)fftw_malloc(sizeof(double) * N);
fftw_complex *out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N);
fftw_plan p = fftw_plan_dft_r2c_1d(N, in, out, FFTW_MEASURE);
for(int n = 0; n < N; n++)
window[n] = windowFunction(n, N);
//ALSA
long loops;
int rc;
int size;
snd_pcm_t *handle;
snd_pcm_hw_params_t *params;
unsigned int val;
int dir=0;
snd_pcm_uframes_t frames;
char *buffer;
/* Open PCM device for recording (capture). */
rc = snd_pcm_open(&handle, "default",
SND_PCM_STREAM_CAPTURE, 0);
if (rc < 0) {
fprintf(stderr,
"unable to open pcm device: %s\n",
snd_strerror(rc));
exit(1);
}
/* Allocate a hardware parameters object. */
snd_pcm_hw_params_alloca(&params);
/* Fill it in with default values. */
snd_pcm_hw_params_any(handle, params);
/* Set the desired hardware parameters. */
/* Interleaved mode */
snd_pcm_hw_params_set_access(handle, params,
SND_PCM_ACCESS_RW_INTERLEAVED);
/* Signed 16-bit little-endian format */
snd_pcm_hw_params_set_format(handle, params,
SND_PCM_FORMAT_S16_LE);
/* One channel (mono) */
snd_pcm_hw_params_set_channels(handle, params, 1);
/* 8000 bits/second sampling rate */
val = 8000;
snd_pcm_hw_params_set_rate_near(handle, params,
&val, &dir);
/* Set period size to 16 frames. */
frames = 16;
snd_pcm_hw_params_set_period_size_near(handle,
params, &frames, &dir);
/* Write the parameters to the driver */
rc = snd_pcm_hw_params(handle, params);
if (rc < 0) {
fprintf(stderr,
"unable to set hw parameters: %s\n",
snd_strerror(rc));
exit(1);
}
/* Use a buffer large enough to hold one period */
snd_pcm_hw_params_get_period_size(params,
&frames, &dir);
size = frames * 2; /* 2 bytes/sample, 1 channel */
buffer = (char *) malloc(size);
/* We want to loop for 5 seconds */
snd_pcm_hw_params_get_period_time(params,
&val, &dir);
loops = 1000000 / val + 25; //added this, because the first values seem to be useless
int count=0;
while (loops > 0) {
loops--;
rc = snd_pcm_readi(handle, buffer, frames);
int i;
short *samples = (short*)buffer;
for (i=0;i < 16;i++)
{
if(count>24){
//cout << (float)map(*samples, -32768, 32768, -1, 1) << endl;
in[i*count]= /*window[i]*/*(double)map(*samples, -32768, 32768, -1, 1);
}
samples++;
}
count++;
if (rc == -EPIPE) {
/* EPIPE means overrun */
fprintf(stderr, "overrun occurred\n");
snd_pcm_prepare(handle);
} else if (rc < 0) {
fprintf(stderr,
"error from read: %s\n",
snd_strerror(rc));
} else if (rc != (int)frames) {
fprintf(stderr, "short read, read %d frames\n", rc);
}
// rc = write(1, buffer, size);
// if (rc != size)
// fprintf(stderr,
// "short write: wrote %d bytes\n", rc);
}
snd_pcm_drain(handle);
snd_pcm_close(handle);
free(buffer);
//FFTW
fftw_execute(p);
for(int j=0;j<N/2;j++){
//cout << in[j] << endl;
cout << sqrt(out[j][0]*out[j][0]+out[j][1]*out[j][1])/N << endl;
/*if(out[j][1]<0.0){
cout << out[j][0] << out[j][1] << "i" << endl;
}else{
cout << out[j][0] << "+" << out[j][1] << "i" << endl;
}*/
}
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
fftw_cleanup();
return 0;
}
I am using 8000 Samples for the FFTW, so I get 4000 values back, which should be the power spectrum. If I now plot the data in MATLAB, the plot does not look like a power spectrum. The input must be right, because if I uncomment this
//cout << (float)map(*samples, -32768, 32768, -1, 1) << endl;
and comment that
cout << sqrt(out[j][0]*out[j][0]+out[j][1]*out[j][1])/N << endl;
and now load the output of the program (which is the input for the FFT) into MATLAB and do a FFT, the plotted data seems to be correct. I tested it with various frequencies, but I always get a weird spectrum when using my own program. As you can see, I also tried to add a hanning window before the FFT, but still no success. So what am I doing wrong here?
Thanks a lot!
The usual suspect with FFT routines is a representation mismatch issue. That is to say, the FFT function fills in an array using one type, and you interpret that array as another type.
You debug this by creating a sine input. You know that this should give a single non-zero input, and you have a reasonable expectation where the zero should be. Because your implementation is wrong, the actual FFT of your sine will differ, and it's this difference that will help troubleshoot the problem.
If you can't figure it out from just the FFT of a sine, next try a cosine, sine of different frequencies, and combinations of two such simple inputs. A cosine is merely a phase shifted sine, so that should just change the phase of that single non-zero value. And the FF should be linear, so the FF of a sum of sines has two sharp peaks.

CUDA tex1Dfetch() wrong behaviour

I'm very new to CUDA programming and I'm facing a problem which is driving me crazy. What's going on:
I have very simple program (just for study purpose) where one input image and one output image 16x16 is created. The input image is initialized to values from 0..255 and then it is bound to texture. The CUDA kernel just copies the input image to the output image. The input image values are obtained by calling the tex1Dfetch() which returns very strange values in some cases. Please see the code below, the comments inside the kernel and the output of the program. The code is complete and compilable so that you can create a CUDA project in VC and paste the code to the main ".cu" file.
Please help me! What I'm doing wrong?
I'm using VS 2013 Community and CUDA SDK 6.5 + CUDA integration for VS 2013.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
texture<unsigned char> tex;
cudaError_t testMyKernel(unsigned char * inputImg, unsigned char * outputImg, int width, int height);
__global__ void myKernel(unsigned char *outImg, int width)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row*width + col;
__shared__ unsigned char input;
__shared__ unsigned char input2;
unsigned char *outPix = outImg + idx;
//It fetches strange value, for example, when the idx==0 then the input is 51.
//But I expect that input==idx (according to the input image initialization).
input = tex1Dfetch(tex, idx);
printf("Fetched for idx=%d: %d\n", idx, input);
*outPix = input;
//Very strange is that when I test the following code then the tex1Dfetch() returns correct values.
if (idx == 0)
{
printf("\nKernel test print:\n");
for (int i = 0; i < 256; i++)
{
input2 = tex1Dfetch(tex, i);
printf("%d,", input2);
}
}
}
int main()
{
const int width = 16;
const int height = 16;
const int count = width * height;
unsigned char imgIn[count];
unsigned char imgOut[count];
for (int i = 0; i < count; i++)
{
imgIn[i] = i;
}
cudaError_t cudaStatus = testMyKernel(imgIn, imgOut, width, height);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "testMyKernel failed!");
return 1;
}
printf("\n\nOutput values:\n");
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
printf("%d,", imgOut[i * width + j]);
}
}
printf("\n");
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
getchar();
return 0;
}
cudaError_t testMyKernel(unsigned char * inputImg, unsigned char * outputImg, int width, int height)
{
unsigned char * dev_in;
unsigned char * dev_out;
size_t size = width * height * sizeof(unsigned char);
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// input data
cudaStatus = cudaMalloc((void**)&dev_in, size);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_in, inputImg, size, cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaBindTexture(NULL, tex, dev_in, size);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaBindTexture failed!");
goto Error;
}
// output data
cudaStatus = cudaMalloc((void**)&dev_out, size);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
dim3 threadsPerBlock(4, 4);
int blk_x = width / threadsPerBlock.x;
int blk_y = height / threadsPerBlock.y;
dim3 numBlocks(blk_x, blk_y);
// Launch a kernel on the GPU with one thread for each element.
myKernel<<<numBlocks, threadsPerBlock>>>(dev_out, width);
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "myKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching myKernel!\n", cudaStatus);
goto Error;
}
//copy output image to host
cudaStatus = cudaMemcpy(outputImg, dev_out, size, cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaUnbindTexture(tex);
cudaFree(dev_in);
cudaFree(dev_out);
return cudaStatus;
}
And here is the output of the program (truncated little bit):
Fetched for idx=0: 51
Fetched for idx=1: 51
Fetched for idx=2: 51
Fetched for idx=3: 51
Fetched for idx=16: 51
Fetched for idx=17: 51
Fetched for idx=18: 51
Fetched for idx=19: 51
Fetched for idx=32: 51
Fetched for idx=33: 51
Fetched for idx=34: 51
Fetched for idx=35: 51
Fetched for idx=48: 51
Fetched for idx=49: 51
Fetched for idx=50: 51
Fetched for idx=51: 51
Fetched for idx=192: 243
Fetched for idx=193: 243
Fetched for idx=194: 243
Fetched for idx=195: 243
Fetched for idx=208: 243
Fetched for idx=209: 243
Fetched for idx=210: 243
Fetched for idx=211: 243
Fetched for idx=224: 243
etc... (output truncated.. see the Output values)
Kernel test print:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
etc...(correct values)
Output values:
51,51,51,51,55,55,55,55,59,59,59,59,63,63,63,63,51,51,51,51,55,55,55,55,59,59,59
,59,63,63,63,63,51,51,51,51,55,55,55,55,59,59,59,59,63,63,63,63,51,51,51,51,55,55,
etc.. (wrong values)
This line of the kernel
input = tex1Dfetch(tex, idx);
is causing race condition among the threads of a block. All threads in a block are trying to fetch value from texture into the __shared__ variable input simultaneously causing undefined behavior. You should allocate separate shared memory space for each thread of the block in the form of a __shared__ array.
For you current case, it may be something like
__shared__ unsigned char input[16]; //4 x 4 block size
The rest of the kernel should look something like:
int idx_local = threadIdx.y * blockDim.x + threadIdx.x; //local id of thread in a block
input[idx_local] = tex1Dfetch(tex, idx);
printf("Fetched for idx=%d: %d\n", idx, input[idx_local]);
*outPix = input[idx_local];
The code inside the condition at the end of the kernel is working fine because due to the specified condition if (idx == 0), only the first thread of the first block will do all the processing serially while all other threads would remain idle, so problem will disappear due to absence of race condition.

Cuda not giving correct answer when array size is larger than 1,000,000

I have wrote a simple sum reduction code which seems to work just fine until i increase array size to 1 million what can be the problem.
#define BLOCK_SIZE 128
#define ARRAY_SIZE 10000
cudaError_t addWithCuda(const long *input, long *output, int totalBlocks, size_t size);
__global__ void sumKernel(const long *input, long *output)
{
int tid = threadIdx.x;
int bid = blockDim.x * blockIdx.x;
__shared__ long data[BLOCK_SIZE];
if(bid+tid < ARRAY_SIZE)
data[tid] = input[bid+tid];
else
data[tid] = 0;
__syncthreads();
for(int i = BLOCK_SIZE/2; i >= 1; i >>= 1)
{
if(tid < i)
data[tid] += data[tid + i];
__syncthreads();
}
if(tid == 0)
output[blockIdx.x] = data[0];
}
int main()
{
int totalBlocks = ARRAY_SIZE/BLOCK_SIZE;
if(ARRAY_SIZE % BLOCK_SIZE != 0)
totalBlocks++;
long *input = (long*) malloc(ARRAY_SIZE * sizeof(long) );
long *output = (long*) malloc(totalBlocks * sizeof(long) );
for(int i=0; i<ARRAY_SIZE; i++)
{
input[i] = i+1 ;
}
// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(input, output, totalBlocks, ARRAY_SIZE);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
long ans = 0;
for(int i =0 ; i < totalBlocks ;i++)
{
ans = ans + output[i];
}
printf("Final Ans : %ld",ans);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
getchar();
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(const long *input, long *output, int totalBlocks, size_t size)
{
long *dev_input = 0;
long *dev_output = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for two vectors (one input, one output) .
cudaStatus = cudaMalloc((void**)&dev_input, size * sizeof(long));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_output, totalBlocks * sizeof(long));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_input, input, size * sizeof(long), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_output, output, (totalBlocks) * sizeof(long), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
sumKernel<<<totalBlocks, BLOCK_SIZE>>>(dev_input, dev_output);
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(output, dev_output, totalBlocks * sizeof(long), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_input);
cudaFree(dev_output);
return cudaStatus;
}
and just for the reference if it has to do somthing with my GPU device, my GPU is GTXX 650ti.
and here is the info about GPU:
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Actually the answer =could not fit in long as well so after using long double for datatypes this issue was resolved. Thanks all!
One problem in your code is that your last cudaMemcpy is not set up correctly:
cudaMemcpy(output, dev_output, totalBlocks * sizeof(int), cudaMemcpyDeviceToHost);
All of your data is long data so you should be copying using sizeof(long) not sizeof(int)
Another problem in your code is using the wrong printf format identifier for a long datatype:
printf("\n %d \n",output[i]);
use something like this instead:
printf("\n %ld \n",output[i]);
You may also have a problem with a large block count if you are not compiling for sm_30 architecture. In that case, proper cuda error checking would identify the problem.
You don't check for errors after sumKernel<<<totalBlocks, BLOCK_SIZE>>>(dev_input, dev_output);. Normally, if you would check for the last occured error it should give the error invalid configuration argument. Try adding the following after the sumKernel line.
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
printf(stderr, "sumKernel failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
See this question for more information about the error.