I have the following "Frankenstein" sum reduction code, taken partly from the common CUDA reduction slices, partly from the CUDA samples.
__global__ void reduce6(float *g_idata, float *g_odata, unsigned int n)
{
extern __shared__ float sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
float mySum = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS];
i += gridSize;
}
__syncthreads();
// do reduction in shared mem
if (tid < 256)
sdata[tid] += sdata[tid + 256];
__syncthreads();
if (tid < 128)
sdata[tid] += sdata[tid + 128];
__syncthreads();
if (tid < 64)
sdata[tid] += sdata[tid + 64];
__syncthreads();
#if (__CUDA_ARCH__ >= 300 )
if ( tid < 32 )
{
// Fetch final intermediate sum from 2nd warp
mySum = sdata[tid]+ sdata[tid + 32];
// Reduce final warp using shuffle
for (int offset = warpSize/2; offset > 0; offset /= 2)
mySum += __shfl_down(mySum, offset);
}
sdata[0]=mySum;
#else
// fully unroll reduction within a single warp
if (tid < 32) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
#endif
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I will be using this to reduce an unrolled array of big size (e.g. 512^3 = 134217728 = n) on a Tesla k40 GPU.
I have some questions regarding the blockSize variable, and its value.
From here on, I will try to explain my understanding (either right or wrong) on how it works:
The bigger I choose blockSize, the faster this code will execute, as it will spend less time in the whole loop, but it will not finish reducing the whole array, but it will return a smaller array of size dimBlock.x, right? If I use blockSize=1 this code would return in 1 call the reduction value, but it will be really slow because its not exploiting the power of CUDA almost anything. Therefore I need to call the reduction kernel several times, each of the time with a smaller blokSize, and reducing the result of the previous call to reduce, until I get to the smallest point.
something like (pesudocode)
blocks=number; //where do we start? why?
while(not the min){
dim3 dimBlock( blocks );
dim3 dimGrid(n/dimBlock.x);
int smemSize = dimBlock.x * sizeof(float);
reduce6<<<dimGrid, dimBlock, smemSize>>>(in, out, n);
in=out;
n=dimGrid.x;
dimGrid.x=n/dimBlock.x; // is this right? Should I also change dimBlock?
}
In which value should I start? I guess this is GPU dependent. Which values shoudl it be for a Tesla k40 (just for me to understand how this values are chosen)?
Is my logic somehow flawed? how?
There is a CUDA tool to get good grid and block sizes for you : Cuda Occupancy API.
In response to "The bigger I choose blockSize, the faster this code will execute" -- Not necessarily, as you want the sizes which give max occupancy (the ratio of active warps to the total number of possible active warps).
See this answer for additional information How do I choose grid and block dimensions for CUDA kernels?.
Lastly, for Nvidia GPUs supporting Kelper or later, there are shuffle intrinsics to make reductions easier and faster. Here is an article on how to use the shuffle intrinsics : Faster Parallel Reductions on Kepler.
Update for choosing number of threads:
You might not want to use the maximum number of threads if it results in a less efficient use of the registers. From the link on occupancy :
For purposes of calculating occupancy, the number of registers used by each thread is one of the key factors. For example, devices with compute capability 1.1 have 8,192 32-bit registers per multiprocessor and can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). This means that in one of these devices, for a multiprocessor to have 100% occupancy, each thread can use at most 10 registers. However, this approach of determining how register count affects occupancy does not take into account the register allocation granularity. For example, on a device of compute capability 1.1, a kernel with 128-thread blocks using 12 registers per thread results in an occupancy of 83% with 5 active 128-thread blocks per multi-processor, whereas a kernel with 256-thread blocks using the same 12 registers per thread results in an occupancy of 66% because only two 256-thread blocks can reside on a multiprocessor.
So the way I understand it is that an increased number of threads has the potential to limit performance because of the way the registers can be allocated. However, this is not always the case, and you need to do the calculation (as in the above statement) yourself to determine the optimal number of threads per block.
Related
currently trying to use the Reduction #3 outline in the CUDA pdf here.
Here is how my Reduction function looks
template <typename T>
__device__ void offsetReduction(planet<T> *bodies, T *outdata, int arrayIdent, int nbodies){
extern __shared__ T sdata[];
unsigned int tID = threadIdx.x;
unsigned int i = tID + blockIdx.x * blockDim.x;
if (arrayIdent == 1){
if (i < nbodies){
sdata[tID] = bodies[i].vx * bodies[i].mass;
}
__syncthreads();
}
if (arrayIdent == 2){
if (i < nbodies){
sdata[tID] = (bodies[i].vy * bodies[i].mass);
}
__syncthreads();
}
if (arrayIdent == 3){
if (i < nbodies){
sdata[tID] = (bodies[i].vz * bodies[i].mass);
}
__syncthreads();
}
for (unsigned int stride = blockDim.x / 2; stride > 0; stride >>=1)
{
if (tID < stride)
{
sdata[tID] += sdata[tID + stride];
}
__syncthreads();
}
if (tID == 0)
{
outdata[blockIdx.x] = sdata[0];
}
However, it doesn't seem to be working correctly so I did some calculations.
I launch the same number of threads as 'int nbodies', and in my case I have chosen 5. So each of the 5 threads comes in and adds a value to sdata[] no problem. However once it gets to the addition part it goes wrong.
On the first iteration Thread 0 accesses sdata[3], Thread 1 accesses sdata[4] and the other threads do nothing. On the second iteration Thread 0 accesses sdata1 and the other threads do nothing. The addition is then 'finished' and the kernel finishes. But sdata[2] is never added so I get an incorrect value stored at sdata[0].
Am I missing something really obvious? (I have been staring at this for a while so I probably have.
This reduction code, like any other "tree-like" reduction operation, requires that the number of threads that participate in the shared memory reduction be equal to a power of 2 to work correctly.
Note that means you could design a reduction kernel which would run correctly for any multiple of 2 threads per block by having the nearest smaller power of 2 threads perform the actual reduction. The code you have posted cannot, however, work like that.
Link to his slides:
http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf
Here's his code for the first version of parallel reduction:
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
which he later optimizes. How is this not just summing all of the ints for each thread block and placing the answer in another vector? Is that what it's meant to do? Isn't *g_odata a vector itself since it's placing the sum at each "blockIdx.x" point in the vector? How do you get the vector g_idata to sum to one single number?
How is this not just summing all of the ints for each thread block and placing the answer in another vector?
It is doing exactly that.
Is that what it's meant to do?
Yes.
Isn't g_odata a vector itself since it's placing the sum at each "blockIdx.x" point in the vector?
Yes, it is the vector containing the block-level sums.
How do you get the vector g_idata to sum to one single number?
Call the kernel twice. Once on the original data set, and once on the vector output from the previous call (the block-level sums). Note that this second step uses only a single block and requires that you can launch enough threads per block to cover the entire vector, one thread per sum from the previous step. If you review the cuda sample code that is intended to accompany that presentation that you linked, you will find such a calling sequence, for example at lines 304 and 333 of reduction.cpp. The second call to reduce<T> performs the reduction that sums the partial block sums, as indicated in the comment on line 324:
304:reduce<T>(n, numThreads, numBlocks, whichKernel, d_idata, d_odata);
// check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
if (cpuFinalReduction)
{
// sum partial sums from each block on CPU
// copy result from device to host
checkCudaErrors(cudaMemcpy(h_odata, d_odata, numBlocks*sizeof(T), cudaMemcpyDeviceToHost));
for (int i=0; i<numBlocks; i++)
{
gpu_result += h_odata[i];
}
needReadBack = false;
}
else
{
324: // sum partial block sums on GPU
int s=numBlocks;
int kernel = whichKernel;
while (s > cpuFinalThreshold)
{
int threads = 0, blocks = 0;
getNumBlocksAndThreads(kernel, s, maxBlocks, maxThreads, blocks, threads);
333: reduce<T>(s, threads, blocks, kernel, d_odata, d_odata);
note that the output d_odata from the first reduction at line 304 is passed as the input to the second reduction on line 333.
Also note that the necessity for, and this method of kernel-decomposition is covered in the presentation you linked on slides 3 - 5.
I am implementing the integral image calculation module using CUDA to improve performance.
But its speed slower than the CPU module.
Please let me know what i did wrong.
cuda kernels and host code follow.
And also, another problem is...
In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically.
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();
Regarding the code that is currently in the question. There are two things I'd like to mention: launch parameters and timing methodology.
1) Launch parameters
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. Ie. 10,000 pieces of work. This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. You are running 1024 threads. This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
Further to this there are a few other things to consider.
The number of blocks should be large in comparison to the number of SMs on your GPU. A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific.
The number of threads per block should not be too small. You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously, and as such is it often a good idea to have block sizes as multiples of 128.
2) Timing
I'm not going to go into so much depth here, but make sure your timing is sane. GPU code tends to have a one-time initialisation delay. If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. Similarly, data transfer between the CPU and GPU takes time. In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated.
The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess.
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item.
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory.
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. That means we still have to read all the data from main memory, just like if the CPU did the processing. Worse, it all has to be written to the GPU's memory, causing a further slowdown. By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.
When working with CUDA there are a few things you should keep in mind.
Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host.
On the device there are 3 types of memory - global, shared, local. You can rank them in speed like global < shared < local (local = fastest).
Reading from consecutive memory blocks is faster than random access. When working with array of structures you would like to transpose it to a structure of arrays.
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program.
the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode).
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. but that would be a different duty then.
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA
Hi I recently have a CUDA kernel to optimize. Here is the original CUDA kernel:
__glboal__ void kernel_base( float *data, int x_dim, int y_dim )
{
int ix = blockIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*x_dim + ix;
float tmp = data[idx];
if( ix % 2 )
{
tmp += sqrtf( sinf(tmp) + 1.f );
}
else
{
tmp += sqrtf( cosf(tmp) + 1.f );
}
data[idx] = tmp;
}
dim3 block( 1, 512 );
dim3 grid( 2048/1, 2048/512 );
kernel<<<grid,block>>>( d_data, 2048, 2048 );
The basic problem here is the dilemma of memory coalescing and thread divergence. The original code processes the array in a column major, so it has strided memory access pattern, but no divergence. I could change it to row-major, which again has the problem of thread divergence.
So does anyone have better idea how to maximize the performance?
Thread divergence here isn't a big problem compared to the strided memory access, in terms of performance. I would go for coalescing. Furthermore, your data storage has an implicit AoS ordering. If you can reorder the data to SoA, you can solve both problems.
So I would reorder this kernel to first handle things in a row-major fashion. This solves the coalescing problem but introduces warp divergence.
If you're unable to re-order the data, I would then consider eliminating warp divergence by modifying the indexing scheme, so that even warps handle even elements, and odd warps handle odd elements.
This will eliminate warp divergence, but will break perfect coalescing again, but the caches should help with this issue. In the case of Fermi, the L1 cache should smooth over this pattern pretty well. I would then compare this case against the warp divergent case, to see which is faster.
Take into account that
sin(x) = cos(x + pi/2)
Accordingly, you can replace the if ... else conditions to
tmp += sqrtf( cosf(tmp + (ix%2) * pi/2) + 1.f );
avoiding branch divergence.
If I were doing this, I would make the block sizes 16 x 16 or some other shape with a lower aspect ratio. I would use shared memory to grab 2 blocks worth of data (each idx grabs 2 elements from data, probably separated by blockDim.x elements), then have each block do its assigned "odd" rows followed by the "even" rows. You'll have to recompute ix, and iy, (and probably idx as well) and you'll use 1/2 as many blocks, but there should be coalesced memory access followed by non-divergent code.
I have some matrices with unknown sizes varying from 10-20.000 in both directions.
I designed a CUDA kernel with (x;y) blocks and (x;y) threads.
Since matrices width/height aren't multiple of my dimensions, it was a terrible pain to get things work and the code is becoming more and more complicated to get coalescence memory reads.
Besides all of that, the kernel is growing in size using more and more registers to check for correctness... so I think this is not the way I should adopt.
My question is: what if I totally eliminate blocks and just create a grid of x;y threads? Will a SM unit have problems without many blocks?
Can I eliminate blocks and use a large amount of threads or is the block subdivision necessary?
You can't really just make a "grid of threads", since you have to organize threads into blocks and you can have a maximum of 512 threads per block. However, you could effectively do this by using 1 thread per block, which will result in a X by Y grid of 1x1 blocks. However, this will result in pretty terrible performance due to several factors:
According to the CUDA Programming Guide, a SM can handle a maximum of 8 blocks at any time. This will limit you to 8 threads per SM, which isn't enough to fill even a single warp. If you have, say, 48 CUDA cores, you will only be able to handle 384 threads at any given time.
With only 8 threads available on a SM, there will be too few warps to hide memory latencies. The GPU will spend most of its time waiting for memory accesses to complete, rather than doing any computations.
You will be unable to coalesce memory reads and writes, resulting in poor memory bandwidth usage.
You will be effectively unable to leverage shared memory, as this is a shared resource between threads in a block.
While having to ensure correctness for threads in a block is annoying, your performance will be vastly better than your "grid of threads" idea.
Here's the code i use to divide a given task requiring num_threads into block and grid. Yes, you might end up launching to many blocks (but only very few) and you will probably end up having more actual threads than required, but it's easy and efficient this way. See the second code example below for my simple in-kernel boundary check.
PS: I always have block_size == 128 because it has been a good tradeoff between multicore occupancy, register usage, shared memory requirements and coalescent access for all of my kernels.
Code to calculate a good grid size (host):
#define GRID_SIZE 65535
//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {
//block
block->x = block_size;
block->y = 1;
block->z = 1;
//number of blocks
unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
unsigned int total_threads = num_blocks * block_size;
assert(total_threads >= num_threads);
//calculate grid size
unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
unsigned int total_blocks = gx * gy;
assert(total_blocks >= num_blocks);
//grid
grid->x = gx;
grid->y = gy;
grid->z = 1;
}
//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
return (numerator + denominator - 1) / denominator;
}
Code to calculate the unique thread id and check boundaries (device):
//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {
//calculate unique id
const unsigned int thread_id = threadIdx.x;
const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
const unsigned int unique_id = thread_id + block_id * blockDim.x;
//check range
if (unique_id >= num_threads) return;
//do the actual work
...
}
I don't think that's a lot of effort/registers/lines-of-code to check for correctness.