HLSL Get number of threadGroups and numthreads in code - c++

my question concerns ComputeShader, HLSL code in particular. So, DeviceContext.Dispath(X, Y, Z) spawns X * Y * Z groups, each of which has x * y * z individual threads set in attribute [numthreads(x,y,z)]. The question is, how can I get total number of ThreadGroups dispatched and number of threads in a group? Let me explain why I want it - the amount of data I intend to process may vary significantly, so my methods should adapt to the size of input arrays. Of course I can send Dispath arguments in constant buffer to make it available from HLSL code, but what about number of threads in a group? I am looking for methods like GetThreadGroupNumber() and GetThreadNumberInGroup(). I appreciate any help.

The number of threads in a group is simply the product of the numthreads dimensions. For example, numthreads(32,8,4) will have 32*8*4 = 1024 threads per group. This can be determined statically at compile time.
The ID for a particular thread-group can be determined by adding a uint3 input argument with the SV_GroupId semantic.
The ID for a particular thread within a thread-group can be determined by adding a uint3 input argument with the SV_GroupThreadID semantic, or uint SV_GroupIndex if you prefer a flattened version.
As far as providing information to each thread on the total size of the dispatch, using a constant buffer is your best bet. This is analogous to the graphics pipeline, where the pixel shader doesn't naturally know the viewport dimensions.
It's also worth mentioning that if you do find yourself in a position where each thread needs to know the overall dispatch size, you should consider restructuring your algorithm. In general, it's better to dispatch a variable numbers of thread groups, each with a fixed amount of work, rather than dispatching a fixed number of threads with a variable amount of work. There are of course exceptions but this will tend provide better utilization of the hardware.

Related

OpenMp: Best way to create an array with size of the number of threads

specific i have to calculate pi parallel with OpenMp. I am just allowed using #omp parallel. So i wanted to create an array with the size of the number of processes and then calculating partially the sum parallel and then calculating the sums together. But it's unfortunately impossible to get the number of threads before the parallel version. So is the best way to create a very large array and initializing it with 0.0 and then calculating everything together or is there a better way? I would appreciate every answer. Thank you in advance!
Fortunately, it is not impossible to obtain the number of threads in advance. The OpenMP runtime does not simply launch a random number of threads without any control from both the programmer and the program user. On the contrary, it follows a well-defined mechanism to determine that number, which is described in detail in the OpenMP specification. Specifically, unless you've supplied a higher fixed number of threads with the num_threads, the number of threads OpenMP launches is limited by the value of the special internal control variable (ICV for short) called nthreads-var. The way to set this ICV is via the OMP_NUM_THREADS environment variable or via the omp_set_num_threads() call (the latter method overrides the former). The value of nthreads-var is accessible by calling omp_get_max_threads(). For other ICVs see the specification.
All you need to do is call omp_get_max_threads() and use the return value as the size of your array, for the number of threads will not exceed that value, given that you aren't calling omp_set_num_threads() with a larger value afterwards and aren't applying the num_threads clause to the parallel construct.

Thread indexing for image rows / columns in CUDA

So I'm working on a CUDA program, and I'm experiencing some issues when it comes to indexing blocks and threads. Basically, I'm trying to implement the Pixel Sort algorithm in CUDA. (With one modification, we are only processing either rows or columns, not both at the same time)
The way I was visualizing it was to simply run N blocks with 1 thread each (for the number of rows, or columns) and have each block process that row / column independently of each other.
So if we want to sort on columns, we launch the kernel like this (there are a couple extra parameters that are only relevant to our specific processing, so I've left them out for simplicity)
pixel_sort<<<cols, 1>>>(d_image, d_imageSort, rows, cols, ...);
Then in the kernel, I access the block index with
int tid = blockIdx.x;
This allows me to work with one row / columns data per block, but it has some issues. It's running slower than our serial implementation of the algorithm for smaller images, and straight up crashes when the image size becomes too large.
An alternative thread scheme I was considering would be to map each of the image's pixels to one thread, however I have a couple of questions on this.
If we were to launch N blocks with M threads, representing N cols with M rows, how do we avoid 512 (or 1024 ?) limit of threads per block. Can we just have each thread process multiple pixels in the column in this instance? How would the indexing look like in the kernel?
The algorithm basically requires that we work on the entire column, hence each thread cannot just do some work on that pixel, they have to communicate, presumably using shared memory. Would it be a valid strategy to have one "master" thread per block, that does the actual sorting calculations, and then have all of the other threads participate in the shared memory?
Other Notes:
Our image data is read in through OpenCV, and has the RGBA values stored in a uchar4 array
If you have a single thread per block, you very quickly run into thread occupancy issues. If your goal is to do a full row sort (for columns you could transpose the image before sending to the GPU to take advantage of global coalescing), the fastest way that gets a decent result is probably to do a radix or merge-sort on a per-row basis, basically copying the steps from http://mgarland.org/files/papers/nvr-2008-001.pdf. You could assign k blocks of m threads each for each row such that km >= image width. Then you would be launching k*(image height) blocks. Your grid would then be of size (k, height, 1).
As for your specific questions:
You cannot get around the 512/1024 thread-per-block limit, you would have to restructure your algorithm.
A "master" thread would generally be poor design, causing stalls, overhead, and not taking full advantage of the many cores. You may sometimes need to utilize a single thread, say to output/broadcast a result, but mostly you want to avoid it. See the linked article for sample algorithms that mostly avoid this.

about organizing threads in cuda

general question: the number of threads must be equal to the size of the elements i want to deal with? exmaple: if i have matrix M[a][b]. i must allocate (aXb) threads or i can allocate more threads than i need(more than ab)? because the thread that will focus on element aXb+1 will throw us out, doesnt he? or the solution is to put a condition(only if in range(ab))?
specific question: let be M[x][y] matrix with x rows and y columns. consider that 1000 <= x <= 300000 and y <= 100. how can i organize the threads in that way that it will be general for each input for x and y. i want that each thread will focus on one element in the matrix. CC = 2.1 thanks!
General answer: It depends on a problem.
In most cases natural one-to-one mapping of the problem to the grid of threads is fine to start with, but what you want to keep in mind is:
Achieving high occupancy.
Maximizing GPU resources usage and memory throughput.
Working with valid data.
Sometimes it may require using single thread to process many elements or many threads to process single element.
For instance, you can imagine an series of independent operations A,B and C that need to be applied on array of elements. You could run three different kernels, but it might be better choice to allocate the grid to contain three times more threads than there is elements and distinguish operations by one of the dimensions of the grid (or anything else). On the other side - you might have a problem that could use maximizing the usage of shared memory (e.g transforming the image) - you could use block of 16 threads to process 5x5 image window where each thread would calculate some statistics of each 2x2 slice.
The choice is yours - the best advice is not always go with the obvious. Try different approaches and choose what works best.

GPGPU: CUDA kernel configuration for 1D thread indexing - threads, blocks, shared memory, and registers

Suppose I have N tasks, where each tasks can be performed by a single thread on the GPU. Suppose also that N = number of threads on the GPU.
Question 1:
Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?
cudaDeviceProp theProps;
dim3 mygrid(theProps.maxGridSize[0], 1, 1);
dim3 myblocks(theProps.maxThreadsDim[0], 1, 1);
mykernel<<<mygrid, myblocks>>>(...);
Question 2:
What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?
Question 3:
Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?
Question 4:
Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?
It's recommended to ask one question per question. Having all sorts of questions makes it very difficult for anyone to give a complete answer. SO isn't really a tutorial service. You should avail yourself of the existing documentation, webinars, and of course there are many other resources available.
Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?
It's certainly possible, all of the threads launched (say it is called N) will be available to perform work, and it will launch a grid of maximum (1D) size. But why do you want to do that anyway? Most cuda programming methodologies don't start out with that as a goal. The grid should be sized to the algorithm. If the 1D grid size appears to be a limiter, you can work around by performing loops in the kernel to handle multiple data elements per thread, or else launch a 2D grid to get around the 1D grid limit. The limit for cc3.x devices has been expanded.
What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?
The first is a limit on the total threads in a multidimensional block (i.e. threads_x*threads_y*threads_z). The second is a limit on the first dimension (x) size. For a 1D threadblock, they are interchangeable, since the y and z dimensions are 1. For a multidimensional block, the multidimensional limit exists to inform users that threadsblocks of, for example, maxThreadsDim[0]*maxThreadsDim[1]*maxThreadsDim[2] are not legal.
Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?
Again, I'm a bit skeptical of the methodology. But yes, the theoretical maximum of possible shared memory bytes per thread would be achieved by a threadblock of smallest number of threads. However, allowing a threadblock to use all the available shared memory may result in only having one threadblock that can be resident on an SM at a time. This may have negative consequences for occupancy, which may have negative consequences for performance. There are many useful recommendations for choosing threadblock size, to maximize performance. I can't summarize them all here. But we want to choose threadblock size as a multiple of warp size, we generally want multiple warps per threadblock, and all other things being equal, we want to enable maximum occupancy (which is related to the number of threadblocks that can be resident on an SM).
Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?
Yes, for a 1-D threadblock and grid structure, this line will give a globally unique thread ID:
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

CUDA: more dimensions for a block or just one?

I need to square root each element of a matrix (which is basically a vector of float values once in memory) using CUDA.
Matrix dimensions are not known 'a priori' and may vary [2-20.000].
I was wondering: I might use (as Jonathan suggested here) one block dimension like this:
int thread_id = blockDim.x * block_id + threadIdx.x;
and check for thread_id lower than rows*columns... that's pretty simple and straight.
But is there any particular performance reason why should I use two (or even three) block grid dimensions to perform such a calculation (keeping in mind that I have a matrix afterall) instead of just one?
I'm thinking at coalescence problems, like making all threads reading values sequentially
The dimensions only exist for convenience, internally everything is linear, so there would be no advantage in terms of efficiency either way. Avoiding the computation of the (contrived) linear index as you've shown above would seem to be a bit faster, but there wouldn't be any difference in how the threads coalesce.