Thread indexing for image rows / columns in CUDA - c++

So I'm working on a CUDA program, and I'm experiencing some issues when it comes to indexing blocks and threads. Basically, I'm trying to implement the Pixel Sort algorithm in CUDA. (With one modification, we are only processing either rows or columns, not both at the same time)
The way I was visualizing it was to simply run N blocks with 1 thread each (for the number of rows, or columns) and have each block process that row / column independently of each other.
So if we want to sort on columns, we launch the kernel like this (there are a couple extra parameters that are only relevant to our specific processing, so I've left them out for simplicity)
pixel_sort<<<cols, 1>>>(d_image, d_imageSort, rows, cols, ...);
Then in the kernel, I access the block index with
int tid = blockIdx.x;
This allows me to work with one row / columns data per block, but it has some issues. It's running slower than our serial implementation of the algorithm for smaller images, and straight up crashes when the image size becomes too large.
An alternative thread scheme I was considering would be to map each of the image's pixels to one thread, however I have a couple of questions on this.
If we were to launch N blocks with M threads, representing N cols with M rows, how do we avoid 512 (or 1024 ?) limit of threads per block. Can we just have each thread process multiple pixels in the column in this instance? How would the indexing look like in the kernel?
The algorithm basically requires that we work on the entire column, hence each thread cannot just do some work on that pixel, they have to communicate, presumably using shared memory. Would it be a valid strategy to have one "master" thread per block, that does the actual sorting calculations, and then have all of the other threads participate in the shared memory?
Other Notes:
Our image data is read in through OpenCV, and has the RGBA values stored in a uchar4 array

If you have a single thread per block, you very quickly run into thread occupancy issues. If your goal is to do a full row sort (for columns you could transpose the image before sending to the GPU to take advantage of global coalescing), the fastest way that gets a decent result is probably to do a radix or merge-sort on a per-row basis, basically copying the steps from http://mgarland.org/files/papers/nvr-2008-001.pdf. You could assign k blocks of m threads each for each row such that km >= image width. Then you would be launching k*(image height) blocks. Your grid would then be of size (k, height, 1).
As for your specific questions:
You cannot get around the 512/1024 thread-per-block limit, you would have to restructure your algorithm.
A "master" thread would generally be poor design, causing stalls, overhead, and not taking full advantage of the many cores. You may sometimes need to utilize a single thread, say to output/broadcast a result, but mostly you want to avoid it. See the linked article for sample algorithms that mostly avoid this.

Related

Does NPP support overlapping streams?

I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp informs the same:
That said, I'm confused if NPP supports overlapping context operations.
Below is a simplification of my code, only showing the async method calls and related variables:
std::vector<NppStreamContext> streams(n_filters);
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
// fill up NppStreamContext remaining fields
// malloc image and filter pointers
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaMemcpyAsync(..., streams[stream_idx].hStream);
nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamSynchronize(streams[stream_idx].hStream);
cudaStreamDestroy(streams[stream_idx].hStream);
}
Note: All the device pointers of the output images and input filters are stored in a std::vector, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx])
To summarize and add to the comments:
The profile does show small overlaps, so the answer to the title question is clearly yes.
The reason for the overlap being so small is just that each NPP kernel already needs all resources of the used GPU for most of its runtime. At the end of each kernel one can probably see the tail effect (i.e. the number of blocks is not a multiple of the number of blocks that can reside in SMs at each moment in time), so blocks from the next kernel are getting scheduled and there is some overlap.
It can sometimes be useful (i.e. an optimization) to force overlap between a big kernel which was started first and uses the full device and a later small kernel that only needs a few resources. In that case one can use stream priorities via cudaStreamCreateWithPriority to hint the scheduler to schedule blocks from the second kernel before blocks from the first kernel. An example of this can be found in this multi-GPU example (permalink).
In this case however, as the size of the kernels is the same and there is no reason to prioritize any of them over the others, forcing an overlap like this would not decrease the total runtime because the compute resources are limited. In the profiler view the kernels might then show more overlap but also each one would take more time. That is the reason why the scheduler does not overlap the kernels even though you allow it to do so by using multiple streams (See asynchronous vs. parallel).
To still increase performance, one could write a custom CUDA kernel that does all the filters in one kernel launch. The main reason that this could be a better than using NPP in this case is that all NPP kernels take the same input image. Therefore a single kernel could significantly decrease the number of accesses to global memory by reading in each tile of the input image only once (to shared memory, although L1 caching might suffice), then apply all the filters sequentially or in parallel (by splitting the thread block up into smaller units) and write out the results.

about organizing threads in cuda

general question: the number of threads must be equal to the size of the elements i want to deal with? exmaple: if i have matrix M[a][b]. i must allocate (aXb) threads or i can allocate more threads than i need(more than ab)? because the thread that will focus on element aXb+1 will throw us out, doesnt he? or the solution is to put a condition(only if in range(ab))?
specific question: let be M[x][y] matrix with x rows and y columns. consider that 1000 <= x <= 300000 and y <= 100. how can i organize the threads in that way that it will be general for each input for x and y. i want that each thread will focus on one element in the matrix. CC = 2.1 thanks!
General answer: It depends on a problem.
In most cases natural one-to-one mapping of the problem to the grid of threads is fine to start with, but what you want to keep in mind is:
Achieving high occupancy.
Maximizing GPU resources usage and memory throughput.
Working with valid data.
Sometimes it may require using single thread to process many elements or many threads to process single element.
For instance, you can imagine an series of independent operations A,B and C that need to be applied on array of elements. You could run three different kernels, but it might be better choice to allocate the grid to contain three times more threads than there is elements and distinguish operations by one of the dimensions of the grid (or anything else). On the other side - you might have a problem that could use maximizing the usage of shared memory (e.g transforming the image) - you could use block of 16 threads to process 5x5 image window where each thread would calculate some statistics of each 2x2 slice.
The choice is yours - the best advice is not always go with the obvious. Try different approaches and choose what works best.

Reordering of work dimensions may cause a huge performance boost, but why?

I am using OpenCL for stereo image processing on the GPU and after I ported a C++ implementation to OpenCL i was playing around with optimizations. A very simple experiment was to swap around the dimensions.
Consider a simple kernel, which is executed for every pixel along the two dimensional work space (f.e. 640x480). In my case it was a Census transform.
Swapping from:
int globalU = get_global_id(0);
int globalV = get_global_id(1);
too
int globalU = get_global_id(1);
int globalV = get_global_id(0);
while adjusting the NDRange in the same way, gave a performance boost about 500%. Other experiments in 3d Space achieved a execution time from 72ms to 2ms, only with reordering of the dimensions.
Can anybody explain my, how this happens?
Is it just an effect of memory pipelines and cache usage?
EDIT: The image has a standart mamory layout. Thats why i wondered about the effects. I expected the best speed, when the iteration goes like the image is stored in the memory, which is not the case.
After some reading of the AMD APP SDK documantation, i found some interesting details about the memory channels. That could be a reason.
When you access an element in a memory in first being loaded into a CPU's cache. CPU does not load single element (say 1 byte), but instead it loads a single line (for example 64 adjacent bytes) into cache. This is because it is usually likely that you will access subsequent elements, so CPU would not need to access RAM again.
This is make a huge difference since to access cache memory, an electrical signal should not even leave CPU chip, while if CPU need to load data from RAM, the signal need to travel to a separate chip and probably more then one signal from CPU is required since it generally need to specify row and column in RAM to access part of it (Read What every programmer should know about memory for more info). In practice access time to cache may take only 0.5 ns while RAM access will cost 100 ns.
So computer algorithms should take this into account. If you traverse through all elements in a matrix, you should traverse them so you would access elements that are located near each other approximately at the same time. So if your matrix has the following layout in memory:
m_0_0, m_0_1, m_0_2, ... m_1_0, m_1_1 (first column, second column, etc.)
you should access elements in order: m_0_0, m_0_1, m_0_2 (by column)
If you would use different access order (say by row in this case), CPU would load part of first column in cache when you access first element in first column, then part of second column when you access first element in second column etc. When you will traverse first row and access second element in first column cache values for the first column would no longer be present in cache, since it has limited (and very small) size. Therefore such an algorithm would effectively eliminate cache at all.

CUDA - Understanding parallel execution of threads (warps) and coalesced memory access

I just started to code in CUDA and I'm trying to get my head around the concepts of how threads are executed and memory accessed in order to get the most out of the GPU. I read through the CUDA best practice guide, the book CUDA by Example and several posts here. I also found the reduction example by Mark Harris quite interesting and useful, but despite all the information I got rather confused on the details.
Let's assume we have a large 2D array (N*M) on which we do column-wise operations. I split the array into blocks so that each block has a number of threads that is a multiple of 32 (all threads fit into several warps). The first thread in each block allocates additional memory (a copy of the initial array, but only for the size of its own dimension) and shares the pointer using a _shared _ variable so that all threads of the same block can access the same memory. Since the number of threads is a multiple of 32, so should be the memory in order to be accessed in a single read. However, I need to have an extra padding around the memory block, a border, so that the width of my array becomes (32*x)+2 columns. The border comes from decomposing the large array, so that I have an overlapping areas in which a copy of its neighbours is temporarily available.
Coeleased memory access:
Imagine the threads of a block are accessing the local memory block
1 int x = threadIdx.x;
2
3 for (int y = 0; y < height; y++)
4 {
5 double value_centre = array[y*width + x+1]; // remeber we have the border so we need an offset of + 1
6 double value_left = array[y*width + x ]; // hence the left element is at x
7 double value_right = array[y*width + x+2]; // and the right element at x+2
8
9 // .. do something
10 }
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned? Note also, if that is not the case then I would have unaligned access to the array for each row after the first one, since the width of my array is (32*x)+2, and hence not 32-byte aligned. A further padding would however solve the problem for each new row.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Thread executed in a warp:
Threads in a warp are only executed in parallel if and only if all the instructions are the same (according to link). If I do have a conditional statement / diverging execution, then that particular thread will be executed by itself and not within a warp with the others.
For example if I initialise the array I could do something like
1 int x = threadIdx.x;
2
3 array[x+1] = globalArray[blockIdx.x * blockDim.x + x]; // remember the border and therefore use +1
4
5 if (x == 0 || x == blockDim.x-1) // border
6 {
7 array[x] = DBL_MAX;
8 }
Will the warp be of size 32 and executed in parallel until line 3 and then stop for all other threads and only the first and last thread further executed to initialise the border, or will those be separated from all other threads already at the beginning, since there is an if statement that all other threads do not fulfill?
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
I wonder what the best practice is, to have either memory perfectly aligned for each row (in which case I would run each block with (32*x-2) threads so that the array with border is (32*x-2)+2 a multiple of 32 for each new line) or do it the way I had demonstrated above, with threads a multiple of 32 for each block and just live with the unaligned memory. I am aware that these sort of questions are not always straightforward and often depend on particular cases, but sometimes certain things are a bad practice and should not become habit.
When I experimented a little bit, I didn't really notice a difference in execution time, but maybe my examples were just too simple. I tried to get information from the visual profiler, but I haven't really understood all the information it gives me. I got however a warning that my occupancy level is at 17%, which I think must be really low and therefore there is something I do wrong. I didn't manage to find information on how threads are executed in parallel and how efficient my memory access is.
-Edit-
Added and highlighted 2 questions, one about memory access, the other one about how threads are collected to a single warp.
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned?
Yes, it does matter "from where you start reading" if you are trying to achieve perfect coalescing. Perfect coalescing means the read activity for a given warp and a given instruction all comes from the same 128-byte aligned cacheline.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Yes. For cc2.0 and higher devices, the cache(s) may mitigate some of the drawbacks of unaligned access.
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
The grouping of threads into warps always follows the same rules, and will not vary based on the specifics of the code you write, but is only affected by your launch configuration. When you write code that not all the threads will participate in (such as in your if statement), then the warp still proceeds in lockstep, but the threads that do not participate are idle. When you are filling in borders like this, it's rarely possible to get perfectly aligned or coalesced reads, so don't worry about it. The machine gives you that flexibility.

GPGPU: CUDA kernel configuration for 1D thread indexing - threads, blocks, shared memory, and registers

Suppose I have N tasks, where each tasks can be performed by a single thread on the GPU. Suppose also that N = number of threads on the GPU.
Question 1:
Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?
cudaDeviceProp theProps;
dim3 mygrid(theProps.maxGridSize[0], 1, 1);
dim3 myblocks(theProps.maxThreadsDim[0], 1, 1);
mykernel<<<mygrid, myblocks>>>(...);
Question 2:
What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?
Question 3:
Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?
Question 4:
Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?
It's recommended to ask one question per question. Having all sorts of questions makes it very difficult for anyone to give a complete answer. SO isn't really a tutorial service. You should avail yourself of the existing documentation, webinars, and of course there are many other resources available.
Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?
It's certainly possible, all of the threads launched (say it is called N) will be available to perform work, and it will launch a grid of maximum (1D) size. But why do you want to do that anyway? Most cuda programming methodologies don't start out with that as a goal. The grid should be sized to the algorithm. If the 1D grid size appears to be a limiter, you can work around by performing loops in the kernel to handle multiple data elements per thread, or else launch a 2D grid to get around the 1D grid limit. The limit for cc3.x devices has been expanded.
What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?
The first is a limit on the total threads in a multidimensional block (i.e. threads_x*threads_y*threads_z). The second is a limit on the first dimension (x) size. For a 1D threadblock, they are interchangeable, since the y and z dimensions are 1. For a multidimensional block, the multidimensional limit exists to inform users that threadsblocks of, for example, maxThreadsDim[0]*maxThreadsDim[1]*maxThreadsDim[2] are not legal.
Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?
Again, I'm a bit skeptical of the methodology. But yes, the theoretical maximum of possible shared memory bytes per thread would be achieved by a threadblock of smallest number of threads. However, allowing a threadblock to use all the available shared memory may result in only having one threadblock that can be resident on an SM at a time. This may have negative consequences for occupancy, which may have negative consequences for performance. There are many useful recommendations for choosing threadblock size, to maximize performance. I can't summarize them all here. But we want to choose threadblock size as a multiple of warp size, we generally want multiple warps per threadblock, and all other things being equal, we want to enable maximum occupancy (which is related to the number of threadblocks that can be resident on an SM).
Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?
Yes, for a 1-D threadblock and grid structure, this line will give a globally unique thread ID:
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;