Why is this simple CUDA kernel getting a wrong result? - c++

I am a newbie with CUDA. I'm learning some basic things because I want to use CUDA in other project. I have wrote this code in order to add all the elements from a squared matrix 8x8 which has been filled with 1's so the result must be 64.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
const int SIZE = 64;
__global__ void add_matrix_values(int* matrix, int sum, int c)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.x * blockDim.x;
sum += matrix[i*c+j];
}
int main()
{
int* device_matrix;
int* host_matrix;
int c = 8; //Squared matrix cxc
int device_c = 8;
int device_sum = 0;
int host_sum = 0;
//Allocate host memory
host_matrix = (int*)malloc(sizeof(int)*SIZE);
//Fill the matrix values with 1's
for(auto i = 0; i < SIZE; i++)
host_matrix[i] = 1;
//Allocate device memory
cudaMalloc((void**) &device_matrix,sizeof(int)*SIZE);
cudaMalloc((void**) &device_sum, sizeof(int));
cudaMalloc((void**) &device_c,sizeof(int));
//Fill device_matrix with host_matrix values
cudaMemcpy(&device_matrix,&host_matrix,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
//Initialize device_sum with a 0
cudaMemcpy(&device_sum,&host_sum,sizeof(int),cudaMemcpyHostToDevice);
//Initialize device_c with the correct value
cudaMemcpy(&device_c,&c,sizeof(int),cudaMemcpyHostToDevice);
//4 blocks with 16 threads every single block ¿Is this correct?
add_matrix_values<<<4,16>>>(device_matrix, device_sum,device_c);
cudaMemcpy(&host_sum,&device_sum,sizeof(int),cudaMemcpyDeviceToHost);
std::cout<<"The value is: "<<host_sum<<std::endl;
cudaFree(device_matrix);
free(host_matrix);
return 0;
}
The result must be 64 but I'm getting wrong numbers.
migue#migue  ~/Escritorio  ./program
The value is: 32762
migue#migue  ~/Escritorio  ./program
The value is: 32608
migue#migue  ~/Escritorio  ./program
The value is: 32559
I dont't know what I'm doing wrong. It could be the gridSize and the blockSize ? or It could be the i and j operation in the cuda Kernel ?
I dont understand very well that terms.

There are a number of issues:
You are creating a 1-D grid (grid configuration, block configuration) so your 2-D indexing in kernel code (i,j, or x,y) doesn't make any sense
You are passing sum by value. You cannot retrieve a result that way. Changes in the kernel to sum won't be reflected in the calling environment. This is a C++ concept, not specific to CUDA. Use a properly allocated pointer instead.
In a CUDA multithreading environment, you cannot have multiple threads update the same location/value without any control. CUDA does not sort out that kind of access for you. You must use a parallel reduction technique, and a simplistic approach here could be to use atomics. You can find many questions here on the cuda tag discussing parallel reductions.
You're generally confusing pass by value and pass by pointer. Items passed by value can be ordinary host variables. You generally don't need a cudaMalloc allocation for those. You also don't use cudaMalloc on any kind of variable except a pointer.
Your use of cudaMemcpy is incorrect. There is no need to take the address of the pointers.
The following code has the above items addressed:
$ cat t135.cu
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
const int SIZE = 64;
__global__ void add_matrix_values(int* matrix, int *sum, int c)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
atomicAdd(sum, matrix[i]);
}
int main()
{
int* device_matrix;
int* host_matrix;
int device_c = 8;
int *device_sum;
int host_sum = 0;
//Allocate host memory
host_matrix = (int*)malloc(sizeof(int)*SIZE);
//Fill the matrix values with 1's
for(auto i = 0; i < SIZE; i++)
host_matrix[i] = 1;
//Allocate device memory
cudaMalloc((void**) &device_matrix,sizeof(int)*SIZE);
cudaMalloc((void**) &device_sum, sizeof(int));
//Fill device_matrix with host_matrix values
cudaMemcpy(device_matrix,host_matrix,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
//Initialize device_sum with a 0
cudaMemcpy(device_sum,&host_sum,sizeof(int),cudaMemcpyHostToDevice);
//4 blocks with 16 threads every single block ¿Is this correct?
add_matrix_values<<<4,16>>>(device_matrix, device_sum,device_c);
cudaMemcpy(&host_sum,device_sum,sizeof(int),cudaMemcpyDeviceToHost);
std::cout<<"The value is: "<<host_sum<<std::endl;
cudaFree(device_matrix);
free(host_matrix);
return 0;
}
$ nvcc -o t135 t135.cu
$ cuda-memcheck ./t135
========= CUDA-MEMCHECK
The value is: 64
========= ERROR SUMMARY: 0 errors
$

Related

Accessing memory allocated on CUDA [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I don't really have any experience with CUDA. I have C++ script that looks like the following
for (int i = 0; i < n; ++i) {
// out_data here is a pointer to some chunk of memory on a CPU
out_data[i] = manipulate_out_data_val(out_data[i]);
}
This is currently set up for CPUs. I would like to adapt this to work with GPU allocated arrays, i.e., if out_data was allocated on GPU, how can do I write the above loop?
I tried porting it over as is with a GPU-allocated array, and the program seg-faults.
I'm not sure if this is relevant, but manipulate_out_data_val applies a constant scaling factor to the input value and then adds a constant to the resulting scaled value.
So firstly, I will convert your function into a CUDA kernel which looks something like this.
__global__ void manipulate_out_data_val(int *array)
{
// Assuming `20` is just a scaling factor.
array[threadIdx.x] *= 20;
}
Please note that for loops will not be used anymore because of the threadIdx parameter that is provided by CUDA. The thread's index replaces the i from your for loop. Please refer to this document to learn more about the CUDA's threading model.
Lets assume the array can store up to 100 integers.
int n = 100;
int bytes = n * sizeof(int);
Initialise an array on the CPU first.
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i = 0;i < n;i++) {
arr_cpu[i] = i;
}
Allocate some memory on the GPU
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
Now, you can copy your CPU array to this allocated GPU memory using the cudaMemcpu function. Note that Host indicates CPU and Device indicates GPU as stated here
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
Finally you can run your kernel. Note the number 1 in kernel syntax is number of blocks and n is number of threads per block.
manipulate_out_data_val<<<1, n>>>(arr_gpu);
Wait until the kernel is finished running
cudaDeviceSynchronize();
Finally, you can move the array from GPU back to CPU
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
Please find the whole code here:
#include <iostream>
#include <cuda.h>
using namespace std;
__global__ void manipulate_out_data_val(int *array)
{
// Can add your constant scaling logic here
array[threadIdx.x] *= 20;
}
int main(int argc,char **argv)
{
int n = 100;
int bytes = n * sizeof(int);
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i=0;i<n;i++)
arr_cpu[i]=i;
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
printf("Copying to device..\n");
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
manipulate_out_data_val<<<1, n>>>(arr_gpu);
cudaDeviceSynchronize();
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
for(int i=0;i<n;i++)
printf("%d,", arr_cpu[i]);
cudaFree(arr_gpu);
return 0;
}
Build and run using:
# program.cu is the file containing the code
nvcc program.cu -o program
# Run
./program
The above code has been tested on CUDA 11.4.

Cuda: XOR single bitset with array of bitsets

I want to XOR a single bitset with a bunch of other bitsets (~100k) and count the set bits of every xor-result. The size of a single bitset is around 20k bits.
The bitsets are already converted to arrays of unsigned int to be able to use the intrinsic __popc()-function. The 'bunch' is already residing contiguously in device-memory.
My current kernel code looks like this:
// Grid/Blocks used for kernel invocation
dim3 block(32);
dim3 grid((bunch_size / 31) + 32);
__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if (tid < bunch_size){ // 1 Thread for each bitset in the 'bunch'
int sum = 0;
uint xor_res = 0;
for (int i = 0; i < bitset_size; ++i){ // Iterate through every uint-block of the bitsets
xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
sum += __popc(xor_res);
}
set_bits[tid] = sum;
}
}
However, compared to a parallelized c++/boost version, I see no benefit using Cuda.
Is there any potential in optimizing this kernel?
Is there any potential in optimizing this kernel?
I see 2 problems here (and they are the first two classical primary optimizations objectives for any CUDA programmer):
You want to try to efficiently use global memory. Your accesses to bitset and bunch are not coalesced. (efficiently use the memory subsystems)
The use of 32 threads per block is generally not recommended and could limit your overall occupancy. One thread per bitset is also potentially problematic. (expose enough parallelism)
Whether addressing those issues will meet your definition of benefit is impossible to say without a comparison test case. Furthermore, simple memory-bound problems like this are rarely interesting in CUDA when considered by themselves. However, we can (probably) improve the performance of your kernel.
We'll use a laundry list of ideas:
have each block handle a bitset, rather than each thread, to enable coalescing
use shared memory to load the comparison bitset, and reuse it
use just enough blocks to saturate the GPU, along with striding loops
use const ... __restrict__ style decoration to possibly benefit from RO cache
Here's a worked example:
$ cat t1649.cu
#include <iostream>
#include <cstdlib>
const int my_bitset_size = 20000/(32);
const int my_bunch_size = 100000;
typedef unsigned uint;
//using one thread per bitset in the bunch
__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if (tid < bunch_size){ // 1 Thread for each bitset in the 'bunch'
int sum = 0;
uint xor_res = 0;
for (int i = 0; i < bitset_size; ++i){ // Iterate through every uint-block of the bitsets
xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
sum += __popc(xor_res);
}
set_bits[tid] = sum;
}
}
const int nTPB = 256;
// one block per bitset, multiple bitsets per block
__global__ void kernelXOR_imp(const uint * __restrict__ bitset, const uint * __restrict__ bunch, int * __restrict__ set_bits, int bitset_size, int bunch_size) {
__shared__ uint sbitset[my_bitset_size]; // could also be dynamically allocated for varying bitset sizes
__shared__ int ssum[nTPB];
// load shared, block-stride loop
for (int idx = threadIdx.x; idx < bitset_size; idx += blockDim.x) sbitset[idx] = bitset[idx];
__syncthreads();
// stride across all bitsets in bunch
for (int bidx = blockIdx.x; bidx < bunch_size; bidx += gridDim.x){
int my_sum = 0;
for (int idx = threadIdx.x; idx < bitset_size; idx += blockDim.x) my_sum += __popc(sbitset[idx] ^ bunch[bidx*bitset_size + idx]);
// block level parallel reduction
ssum[threadIdx.x] = my_sum;
for (int ridx = nTPB>>1; ridx > 0; ridx >>=1){
__syncthreads();
if (threadIdx.x < ridx) ssum[threadIdx.x] += ssum[threadIdx.x+ridx];}
if (!threadIdx.x) set_bits[bidx] = ssum[0];}
}
int main(){
// data setup
uint *d_cbitset, *d_bitsets, *h_cbitset, *h_bitsets;
int *d_r, *h_r, *h_ri;
h_cbitset = new uint[my_bitset_size];
h_bitsets = new uint[my_bitset_size*my_bunch_size];
h_r = new int[my_bunch_size];
h_ri = new int[my_bunch_size];
for (int i = 0; i < my_bitset_size*my_bunch_size; i++){
h_bitsets[i] = rand();
if (i < my_bitset_size) h_cbitset[i] = rand();}
cudaMalloc(&d_cbitset, my_bitset_size*sizeof(uint));
cudaMalloc(&d_bitsets, my_bitset_size*my_bunch_size*sizeof(uint));
cudaMalloc(&d_r, my_bunch_size*sizeof(int));
cudaMemcpy(d_cbitset, h_cbitset, my_bitset_size*sizeof(uint), cudaMemcpyHostToDevice);
cudaMemcpy(d_bitsets, h_bitsets, my_bitset_size*my_bunch_size*sizeof(uint), cudaMemcpyHostToDevice);
// original
// Grid/Blocks used for kernel invocation
dim3 block(32);
dim3 grid((my_bunch_size / 31) + 32);
kernelXOR<<<grid, block>>>(d_cbitset, d_bitsets, d_r, my_bitset_size, my_bunch_size);
cudaMemcpy(h_r, d_r, my_bunch_size*sizeof(int), cudaMemcpyDeviceToHost);
// improved
dim3 iblock(nTPB);
dim3 igrid(640);
kernelXOR_imp<<<igrid, iblock>>>(d_cbitset, d_bitsets, d_r, my_bitset_size, my_bunch_size);
cudaMemcpy(h_ri, d_r, my_bunch_size*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < my_bunch_size; i++)
if (h_r[i] != h_ri[i]) {std::cout << "mismatch at i: " << i << " was: " << h_ri[i] << " should be: " << h_r[i] << std::endl; return 0;}
std::cout << "Results match." << std::endl;
return 0;
}
$ nvcc -o t1649 t1649.cu
$ cuda-memcheck ./t1649
========= CUDA-MEMCHECK
Results match.
========= ERROR SUMMARY: 0 errors
$ nvprof ./t1649
==18868== NVPROF is profiling process 18868, command: ./t1649
Results match.
==18868== Profiling application: ./t1649
==18868== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.06% 71.113ms 2 35.557ms 2.3040us 71.111ms [CUDA memcpy HtoD]
2.26% 1.6563ms 1 1.6563ms 1.6563ms 1.6563ms kernelXOR(unsigned int*, unsigned int*, int*, int, int)
0.59% 432.68us 1 432.68us 432.68us 432.68us kernelXOR_imp(unsigned int const *, unsigned int const *, int*, int, int)
0.09% 64.770us 2 32.385us 31.873us 32.897us [CUDA memcpy DtoH]
API calls: 78.20% 305.44ms 3 101.81ms 11.373us 304.85ms cudaMalloc
18.99% 74.161ms 4 18.540ms 31.554us 71.403ms cudaMemcpy
1.39% 5.4121ms 4 1.3530ms 675.30us 3.3410ms cuDeviceTotalMem
1.26% 4.9393ms 388 12.730us 303ns 530.95us cuDeviceGetAttribute
0.11% 442.37us 4 110.59us 102.61us 125.59us cuDeviceGetName
0.03% 128.18us 2 64.088us 21.789us 106.39us cudaLaunchKernel
0.01% 35.764us 4 8.9410us 2.9670us 18.982us cuDeviceGetPCIBusId
0.00% 8.3090us 8 1.0380us 540ns 1.3870us cuDeviceGet
0.00% 5.9530us 3 1.9840us 310ns 3.9900us cuDeviceGetCount
0.00% 2.8800us 4 720ns 574ns 960ns cuDeviceGetUuid
$
In this case, on my Tesla V100, for your problem size, I witness about a 4x improvement in kernel performance. However the kernel performance here is tiny compared to the cost of data movement. So it's unlikely that these sort of optimizations would make a significant difference in your comparison test case, if this is the only thing you are doing on the GPU.
The code above uses striding-loops at the block level and at the grid level, which means it should behave correctly for almost any choice of threadblock size (multiple of 32 please) as well as grid size. That doesn't mean that any/all choices will perform equally. The choice of the threadblock size is to allow the possibility for nearly full occupancy (so don't choose 32). The choice of the grid size is the number of blocks to achieve full occupancy per SM, times the number of SMs. These should be nearly optimal choices, but according to my testing e.g. a larger number of blocks doesn't really reduce performance, and the performance should be roughly constant for nearly any threadblock size (except 32), assuming the number of blocks is calculated accordingly.

Wrong results with CUDA threads writing on private locations in global memory

EDIT 3:
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I'll list the main variables and structures involved.
Variables:
srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
idx (device) : identification number of current thread
My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx][COLORLEVELS] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.
Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I'm debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run.. So I was wondering if you could suggest any other solution to my problem.
Thank you.
#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
#include "cuPrintf.cu"
using namespace std;
#define ARRDIM 512
#define COLORLEVELS 4
__global__ void gpuKernel
(
float *sa, float *aux,
size_t memPitchAux, int w,
float *c_glob
)
{
float sc_loc[COLORLEVELS];
float g0=0.0f;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int idx = tidy * memPitchAux/4 + tidx;
for(int ic=0; ic<COLORLEVELS; ic++)
{
sc_loc[ic] = ((float)(ic*ic));
}
for(int is=0; is<COLORLEVELS; is++)
{
int ic = fabs(sa[tidy*w +tidx]);
c_glob[tidy * COLORLEVELS + tidx + ic] += 1.0f;
}
for(int ic=0; ic<COLORLEVELS; ic++)
{
g0 += c_glob[tidy * COLORLEVELS + tidx + ic]*sc_loc[ic];
}
aux[idx] = g0;
}
int main(int argc, char* argv[])
{
/*
* array src host and device
*/
int heightSrc = ARRDIM;
int widthSrc = ARRDIM;
cudaSetDevice(0);
float *srcArr_h, *srcArr_d;
size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;
srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero
int totArrElm = heightSrc*widthSrc;
for(int ic=0; ic<totArrElm; ic++)
{
srcArr_h[ic] = (float)(rand() % COLORLEVELS);
}
cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);
/*
* auxiliary buffer auxD to save final results
*/
float *auxD;
size_t auxDPitch;
cudaMallocPitch((void**)&auxD,&auxDPitch,widthSrc*sizeof(float),heightSrc);
cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);
/*
* auxiliary buffer auxH allocation + initialization on host
*/
size_t auxHPitch;
auxHPitch = widthSrc*sizeof(float);
float *auxH = (float *) malloc(heightSrc*auxHPitch);
/*
* kernel launch specs
*/
int thpb_x = 16;
int thpb_y = 16;
int blpg_x = (int) widthSrc/thpb_x;
int blpg_y = (int) heightSrc/thpb_y;
int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;
/*
* c_glob: array that reserves a private location of COLORLEVELS floats for each thread
*/
int cglob_w = COLORLEVELS;
int cglob_h = num_threads;
float *c_glob_d;
size_t c_globDPitch;
cudaMallocPitch((void**)&c_glob_d,&c_globDPitch,cglob_w*sizeof(float),cglob_h);
cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);
/*
* kernel launch
*/
dim3 dimBlock(thpb_x,thpb_y, 1);
dim3 dimGrid(blpg_x,blpg_y,1);
gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);
cudaThreadSynchronize();
cudaMemcpy2D(auxH,auxHPitch,
auxD,auxDPitch,
auxHPitch, heightSrc,
cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float min = auxH[0];
float max = auxH[0];
float f;
string str;
for(int i=0; i<widthSrc*heightSrc; i++)
{
if(min > auxH[i])
min = auxH[i];
if(max < auxH[i])
max = auxH[i];
}
cudaFree(srcArr_d);
cudaFree(auxD);
cudaFree(c_glob_d);
}
You decided neither not to show the whole code nor a reduced size thereof reproducing your problem. Therefore, it has not been possible to make tests and verify the possible solution below.
I think you have spot the source of the problem: multiple threads are trying to write to the same memory locations in parallel. This is a situation leading to race conditions. For an example, see the fourth slide of the presentation "CUDA C: race conditions, atomics, locks, mutex, and warps".
Race conditions have a brute-force solution: atomic functions. They are described at Section B.12 of the CUDA C Programming Guide. So you can try to fix your problem by changing the line
c[ic] += 1.0f;
to
atomicAdd(&c[ic],1);
You will pay this fix with performance: atomic operations serialize the code to avoid race conditions.
I have mentioned that atomic functions are a brute-force solution to your problem because it can be that, by properly rethinking the implementation, you can find a way to avoid them. But this is not possible to say as of now due to the very few details you provided.

Unable to find simple sum of 1 to 100 numbers in CUDA?

I am working on image processing algorithm using CUDA. In my algorithm i want to find sum of all pixels of image using CUDA kernel. so i made kernel method in cuda for measure sum of all pixels of 16 bit gray scale image, but i got wrong answer.
So i make simple program in cuda for find sum of 1 to 100 numbers and my code is below.
In my code i got not exact sum of that 1 to 100 numbers using GPU, but i got exact sum of that 1 to 100 numbers using CPU. So what i had done in that code ?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <malloc.h>
#include <limits>
#include <math.h>
using namespace std;
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
sum[0] = sum[0] + (pixels[(x)]);
__syncthreads();
}
int main(int argc, char **argv)
{
double *data;
double *dev_data;
double *dev_total;
double *total;
data=new double[(100) * sizeof(double)];
total=new double[(1) * sizeof(double)];
double cpuSum=0.0;
for(int i=0;i<100;i++){
data[i]=i+1;
cpuSum=cpuSum+data[i];
}
cout<<"CPU total = "<<cpuSum<<std::endl;
cudaMalloc( (void**)&dev_data, 100 * sizeof(double));
cudaMalloc( (void**)&dev_total, 1 * sizeof(double));
cudaMemcpy(dev_data, data, 100 * sizeof(double), cudaMemcpyHostToDevice);
computeMeanValue1<<<1,100>>>(dev_data,dev_total);
cudaDeviceSynchronize();
cudaMemcpy(total, dev_total, 1* sizeof(double), cudaMemcpyDeviceToHost);
cout<<"GPU total = "<<total[0]<<std::endl;
cudaFree(dev_data);
cudaFree(dev_total);
free(data);
free(total);
getch();
return 0;
}
All your threads are writing to the same memory location at the same time.
sum[0] = sum[0] + (pixels[(x)]);
You can't do this and expect to get the correct result. Your kernel needs to take a different approach to avoid writing to the same memory from different threads. The pattern usually employed for doing this is reduction. Simply put with a reduction each thread is responsible for summing a block of elements within the array and then storing the result. By employing a series of these reduction operations its possible to sum the entire contents of the array.
__global__ void block_sum(const float *input,
float *per_block_results,
const size_t n)
{
extern __shared__ float sdata[];
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
float x = 0;
if(i < n)
{
x = input[i];
}
sdata[threadIdx.x] = x;
__syncthreads();
// contiguous range pattern
for(int offset = blockDim.x / 2;
offset > 0;
offset >>= 1)
{
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
// wait until all threads in the block have
// updated their partial sums
__syncthreads();
}
// thread 0 writes the final result
if(threadIdx.x == 0)
{
per_block_results[blockIdx.x] = sdata[0];
}
}
Each thread writes to a different location in sdata[threadIdx.x] there is no race condition. Threads are free to access other elements in sdata because they only read from them so there are no race conditions. Note the use of __syncthreads() to ensure that the operations to load data into sdata are complete before the threads start to read the data and the second call to __syncthreads() to ensure that all the summation operations have completed before copying the final result from sdata[0]. Note that only thread 0 writes its result to per_block_results[blockIdx.x], so there is no race condition there either.
You can find the complete sample code for the above on Google Code (I did not write this). This slide deck has a reasonable summary of reductions in CUDA. It includes diagrams which really help in understanding how the interleaved memory reads and writes do not conflict with each other.
You can find lots of other material on efficient implementations of reduction on GPUs. Ensuring that your implementation makes most efficient use of memory is key to getting the best performance out of a memory bound operation like reduction.
In GPU code, we have multiple threads executing in parallel. If all of those threads attempt to update the same location in memory, we have undefined behavior, unless we use special operations, called atomics to do the update.
In your case, since sum is updated by all threads, and sum is a double quantity, we can use the special custom atomic function described in the programming guide to accomplish this.
If I replace your kernel code with the following:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
atomicAdd(sum, pixels[x]);
}
And initialize the sum value to zero before the kernel:
double gpuSum = 0.0;
cudaMemcpy(dev_total, &gpuSum, sizeof(double), cudaMemcpyHostToDevice);
Then I think you'll get matching results.
As #AdeMiller pointed out, the faster way to perform parallel sums like this is via classical parallel reduction.
There is a CUDA sample code that demonstrates this and an accompanying presentation that covers the methodology.

CUDA counting, reduction and thread warps

I'm trying to create a cuda program that counts the number of true values (defined by non-zero values) in a long vector through a reduction algorithm. I'm getting funny results. I get either 0 or (ceil(N/threadsPerBlock)*threadsPerBlock), neither is correct.
__global__ void count_reduce_logical(int * l, int * cntl, int N){
// suml is assumed to blockDim.x long and hold the partial counts
__shared__ int cache[threadsPerBlock];
int cidx = threadIdx.x;
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int cnt_tmp=0;
while(tid<N){
if(l[tid]!=0)
cnt_tmp++;
tid+=blockDim.x*gridDim.x;
}
cache[cidx]=cnt_tmp;
__syncthreads();
//reduce
int k =blockDim.x/2;
while(k!=0){
if(threadIdx.x<k)
cache[cidx] += cache[cidx];
__syncthreads();
k/=2;
}
if(cidx==0)
cntl[blockIdx.x] = cache[0];
}
The host code then collects the cntl results and finishes summation. This is going to be part of a larger project where the data is already on the GPU, so it makes sense to do the computations there, if they work correctly.
You can count the nonzero-values with a single line of code using Thrust. Here's a code snippet that counts the number of 1s in a device_vector.
#include <thrust/count.h>
#include <thrust/device_vector.h>
...
// put three 1s in a device_vector
thrust::device_vector<int> vec(5,0);
vec[1] = 1;
vec[3] = 1;
vec[4] = 1;
// count the 1s
int result = thrust::count(vec.begin(), vec.end(), 1);
// result == 3
If your data does not live inside a device_vector you can still use thrust::count by wrapping the raw pointers.
In your reduction you're doing:
cache[cidx] += cache[cidx];
Don't you want to be poking at the other half of the block's local values?