'Official' CUDA Reduction function can't accept certain numbers? - c++

currently trying to use the Reduction #3 outline in the CUDA pdf here.
Here is how my Reduction function looks
template <typename T>
__device__ void offsetReduction(planet<T> *bodies, T *outdata, int arrayIdent, int nbodies){
extern __shared__ T sdata[];
unsigned int tID = threadIdx.x;
unsigned int i = tID + blockIdx.x * blockDim.x;
if (arrayIdent == 1){
if (i < nbodies){
sdata[tID] = bodies[i].vx * bodies[i].mass;
if (arrayIdent == 2){
if (i < nbodies){
sdata[tID] = (bodies[i].vy * bodies[i].mass);
if (arrayIdent == 3){
if (i < nbodies){
sdata[tID] = (bodies[i].vz * bodies[i].mass);
for (unsigned int stride = blockDim.x / 2; stride > 0; stride >>=1)
if (tID < stride)
sdata[tID] += sdata[tID + stride];
if (tID == 0)
outdata[blockIdx.x] = sdata[0];
However, it doesn't seem to be working correctly so I did some calculations.
I launch the same number of threads as 'int nbodies', and in my case I have chosen 5. So each of the 5 threads comes in and adds a value to sdata[] no problem. However once it gets to the addition part it goes wrong.
On the first iteration Thread 0 accesses sdata[3], Thread 1 accesses sdata[4] and the other threads do nothing. On the second iteration Thread 0 accesses sdata1 and the other threads do nothing. The addition is then 'finished' and the kernel finishes. But sdata[2] is never added so I get an incorrect value stored at sdata[0].
Am I missing something really obvious? (I have been staring at this for a while so I probably have.

This reduction code, like any other "tree-like" reduction operation, requires that the number of threads that participate in the shared memory reduction be equal to a power of 2 to work correctly.
Note that means you could design a reduction kernel which would run correctly for any multiple of 2 threads per block by having the nearest smaller power of 2 threads perform the actual reduction. The code you have posted cannot, however, work like that.


Kernels Synchronisation

I'm new to Cuda programming and I'm implementing the classical Floyd APSP Algorithm. This algorithm consists in 3 nested loops and all the code inside the two inner loops can be executed in parallel.
As main parts of my code, here is the kernel code:
__global__ void dfloyd(double *dM, size_t k, size_t n)
unsigned int x = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int y = threadIdx.y + blockIdx.y * blockDim.y;
unsigned int index = y * n + x;
double d;
if (x < n && y < n)
d=dM[x+k*n] + dM[k+y*n];
if (d<dM[index])
and here is the part from the main function where the kernels are launched (for readability I omitted error handling code):
double *dM;
cudaMalloc((void **)&dM, sizeof_M);
cudaMemcpy(dM, hM, sizeof_M, cudaMemcpyHostToDevice);
int dimx = 32;
int dimy = 32;
dim3 block(dimx, dimy);
dim3 grid((n + block.x - 1) / block.x, (n + block.y - 1) / block.y);
for (size_t k=0; k<n; k++)
dfloyd<<<grid, block>>>(dM, k, n);
cudaMemcpy(hM, dM, sizeof_M, cudaMemcpyDeviceToHost);
[For the understanding, dM is referring to the distance matrix stored in the device side and hM in the host side and n is referring to the number of nodes.]
Kernels inside the k-loop have to be executed serially, this explains why I write the cudaDeviceSynchronize() instruction after each kernel execution.
However, I notice that putting this synchro instruction outside the loop leads to the same result.
Now, my question. Do the two following pieces of code
for (size_t k=0; k<n; k++)
dfloyd<<<grid, block>>>(dM, k, n);
for (size_t k=0; k<n; k++)
dfloyd<<<grid, block>>>(dM, k, n);
are equivalent?
They are not equivalent but will give the same results. The first one will make the host wait after each kernel call until the kernel has returned, while the other one will make it wait only once.
Maybe the confusing part is why does it work; in CUDA, two consecutive kernel calls on the same stream (in your case, default stream) are guaranteed to be executed serially.
Performance wise, it is advised to use the second version, as synchronisation with the host adds overhead.
Edit: in that specific case, you do not even need to call cudaDeviceSynchronize() because the cudaMemcpy will synchronize.

CUDA reduction, approach for big arrays

I have the following "Frankenstein" sum reduction code, taken partly from the common CUDA reduction slices, partly from the CUDA samples.
__global__ void reduce6(float *g_idata, float *g_odata, unsigned int n)
extern __shared__ float sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
float mySum = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS];
i += gridSize;
// do reduction in shared mem
if (tid < 256)
sdata[tid] += sdata[tid + 256];
if (tid < 128)
sdata[tid] += sdata[tid + 128];
if (tid < 64)
sdata[tid] += sdata[tid + 64];
#if (__CUDA_ARCH__ >= 300 )
if ( tid < 32 )
// Fetch final intermediate sum from 2nd warp
mySum = sdata[tid]+ sdata[tid + 32];
// Reduce final warp using shuffle
for (int offset = warpSize/2; offset > 0; offset /= 2)
mySum += __shfl_down(mySum, offset);
// fully unroll reduction within a single warp
if (tid < 32) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
I will be using this to reduce an unrolled array of big size (e.g. 512^3 = 134217728 = n) on a Tesla k40 GPU.
I have some questions regarding the blockSize variable, and its value.
From here on, I will try to explain my understanding (either right or wrong) on how it works:
The bigger I choose blockSize, the faster this code will execute, as it will spend less time in the whole loop, but it will not finish reducing the whole array, but it will return a smaller array of size dimBlock.x, right? If I use blockSize=1 this code would return in 1 call the reduction value, but it will be really slow because its not exploiting the power of CUDA almost anything. Therefore I need to call the reduction kernel several times, each of the time with a smaller blokSize, and reducing the result of the previous call to reduce, until I get to the smallest point.
something like (pesudocode)
blocks=number; //where do we start? why?
while(not the min){
dim3 dimBlock( blocks );
dim3 dimGrid(n/dimBlock.x);
int smemSize = dimBlock.x * sizeof(float);
reduce6<<<dimGrid, dimBlock, smemSize>>>(in, out, n);
dimGrid.x=n/dimBlock.x; // is this right? Should I also change dimBlock?
In which value should I start? I guess this is GPU dependent. Which values shoudl it be for a Tesla k40 (just for me to understand how this values are chosen)?
Is my logic somehow flawed? how?
There is a CUDA tool to get good grid and block sizes for you : Cuda Occupancy API.
In response to "The bigger I choose blockSize, the faster this code will execute" -- Not necessarily, as you want the sizes which give max occupancy (the ratio of active warps to the total number of possible active warps).
See this answer for additional information How do I choose grid and block dimensions for CUDA kernels?.
Lastly, for Nvidia GPUs supporting Kelper or later, there are shuffle intrinsics to make reductions easier and faster. Here is an article on how to use the shuffle intrinsics : Faster Parallel Reductions on Kepler.
Update for choosing number of threads:
You might not want to use the maximum number of threads if it results in a less efficient use of the registers. From the link on occupancy :
For purposes of calculating occupancy, the number of registers used by each thread is one of the key factors. For example, devices with compute capability 1.1 have 8,192 32-bit registers per multiprocessor and can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). This means that in one of these devices, for a multiprocessor to have 100% occupancy, each thread can use at most 10 registers. However, this approach of determining how register count affects occupancy does not take into account the register allocation granularity. For example, on a device of compute capability 1.1, a kernel with 128-thread blocks using 12 registers per thread results in an occupancy of 83% with 5 active 128-thread blocks per multi-processor, whereas a kernel with 256-thread blocks using the same 12 registers per thread results in an occupancy of 66% because only two 256-thread blocks can reside on a multiprocessor.
So the way I understand it is that an increased number of threads has the potential to limit performance because of the way the registers can be allocated. However, this is not always the case, and you need to do the calculation (as in the above statement) yourself to determine the optimal number of threads per block.

CUDA: Isn't Mark Harris's parallel reduction sample just summing each thread block?

Link to his slides:
Here's his code for the first version of parallel reduction:
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
which he later optimizes. How is this not just summing all of the ints for each thread block and placing the answer in another vector? Is that what it's meant to do? Isn't *g_odata a vector itself since it's placing the sum at each "blockIdx.x" point in the vector? How do you get the vector g_idata to sum to one single number?
How is this not just summing all of the ints for each thread block and placing the answer in another vector?
It is doing exactly that.
Is that what it's meant to do?
Isn't g_odata a vector itself since it's placing the sum at each "blockIdx.x" point in the vector?
Yes, it is the vector containing the block-level sums.
How do you get the vector g_idata to sum to one single number?
Call the kernel twice. Once on the original data set, and once on the vector output from the previous call (the block-level sums). Note that this second step uses only a single block and requires that you can launch enough threads per block to cover the entire vector, one thread per sum from the previous step. If you review the cuda sample code that is intended to accompany that presentation that you linked, you will find such a calling sequence, for example at lines 304 and 333 of reduction.cpp. The second call to reduce<T> performs the reduction that sums the partial block sums, as indicated in the comment on line 324:
304:reduce<T>(n, numThreads, numBlocks, whichKernel, d_idata, d_odata);
// check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
if (cpuFinalReduction)
// sum partial sums from each block on CPU
// copy result from device to host
checkCudaErrors(cudaMemcpy(h_odata, d_odata, numBlocks*sizeof(T), cudaMemcpyDeviceToHost));
for (int i=0; i<numBlocks; i++)
gpu_result += h_odata[i];
needReadBack = false;
324: // sum partial block sums on GPU
int s=numBlocks;
int kernel = whichKernel;
while (s > cpuFinalThreshold)
int threads = 0, blocks = 0;
getNumBlocksAndThreads(kernel, s, maxBlocks, maxThreads, blocks, threads);
333: reduce<T>(s, threads, blocks, kernel, d_odata, d_odata);
note that the output d_odata from the first reduction at line 304 is passed as the input to the second reduction on line 333.
Also note that the necessity for, and this method of kernel-decomposition is covered in the presentation you linked on slides 3 - 5.

CUDA combining thread independent(??) variables during execution

Guys I apologize if the title is confusing. I though long and hard and couldn't come up with proper way to phrase the question in a single line. So here's more detail. I am doing a basic image subtraction where the second image has been modified and I need to find the ratio of how much change was done to the image. for this I used the following code. Both images are 128x1024.
for(int i = 0; i < 128; i++)
for(int j = 0; j < 1024; j++)
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
ratio = num/den;
The above code works fine on the CPU but I want to try to do this on CUDA. For this I can setup CUDA to do the basic subtraction of the images (code below) but I can't figure out how to do the conditional if statement to get my ratio out.
__global__ void calcRatio(float *orig, float *modified, int size, float *result)
int index = threadIdx.x + blockIdx.x * blockDim.x;
if(index < size)
result[index] = orig[index] - modified[index];
So, up to this point it works but I cannot figure out how to parrallelize the num and den counters in each thread to calculate the ratio at the end of all the thread executions. To me it feels like the num and den counders are independent of the threads as every time I have tried to use them it seems they get incremented only once.
Any help will be appreciated as I am just starting out in CUDA and every example I see online never seems to apply to what I need to do.
EDIT: Fixed my naive code. Forgot to type one of the main condition in the code. It was a long long day.
for(int i = 0; i < 128; i++)
for(int j = 0; j < 1024; j++)
if(modified[i * 1024 + j] < 400.0) //400.0 threshold value to ignore noise
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
ratio = num/den;
The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.
If I were writing code to do what you want, it would probably look like this:
template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result,
int *count, const float error)
__shared__ volatile float buff[blocksize];
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
int count = 0;
for(int i=index; i<n; i+=stride) {
val = orig[index] - modified[index];
count += (val < error);
result[index] = val;
buff[threadIdx.x] = count;
// Parallel reduction in shared memory using 1 warp
if (threadId.x < warpSize) {
for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
buff[threadIdx.x] += buff[i];
if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
if (threadIdx.x < 8) buff[threadIdx.x] +=buff[threadIdx.x + 8];
if (threadIdx.x < 4) buff[threadIdx.x] +=buff[threadIdx.x + 4];
if (threadIdx.x < 2) buff[threadIdx.x] +=buff[threadIdx.x + 2];
if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
The first stanza does what your serial code does - computes a difference and a thread local total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.
The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the block. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.
Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work.
The denominator is pretty simple, since it's just the size.
The numerator is more troublesome, since its value for a given thread depends on all previous values. You're going to have to do that operation serially.
The thing you're looking for is probably atomicAdd. It's very slow, though.
I think you'd find this question relevant. Your num is basically global data.
CUDA array-to-array sum
Alternatively, you could dump the results of the error check into an array. Counting the results could then be parallelized. It would be a little tricky, but I think something like this would scale up: http://tekpool.wordpress.com/2006/09/25/bit-count-parallel-counting-mit-hakmem/

CUDA counting, reduction and thread warps

I'm trying to create a cuda program that counts the number of true values (defined by non-zero values) in a long vector through a reduction algorithm. I'm getting funny results. I get either 0 or (ceil(N/threadsPerBlock)*threadsPerBlock), neither is correct.
__global__ void count_reduce_logical(int * l, int * cntl, int N){
// suml is assumed to blockDim.x long and hold the partial counts
__shared__ int cache[threadsPerBlock];
int cidx = threadIdx.x;
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int cnt_tmp=0;
int k =blockDim.x/2;
cache[cidx] += cache[cidx];
cntl[blockIdx.x] = cache[0];
The host code then collects the cntl results and finishes summation. This is going to be part of a larger project where the data is already on the GPU, so it makes sense to do the computations there, if they work correctly.
You can count the nonzero-values with a single line of code using Thrust. Here's a code snippet that counts the number of 1s in a device_vector.
#include <thrust/count.h>
#include <thrust/device_vector.h>
// put three 1s in a device_vector
thrust::device_vector<int> vec(5,0);
vec[1] = 1;
vec[3] = 1;
vec[4] = 1;
// count the 1s
int result = thrust::count(vec.begin(), vec.end(), 1);
// result == 3
If your data does not live inside a device_vector you can still use thrust::count by wrapping the raw pointers.
In your reduction you're doing:
cache[cidx] += cache[cidx];
Don't you want to be poking at the other half of the block's local values?