Occupancy in CUDA is defined as
occupancy = active_warps / maximum_active_warps
What is the difference between a resident CUDA warp and an active one?
From my research on the web it seems that a block is resident (i.e. allocated along with its register/shared memory files) on a SM for the entire duration of its execution. Is there a difference with "being active"?
If I have a kernel which uses very few registers and shared memory.. does it mean that I can have maximum_active_warps resident blocks and achieve 100% occupancy since occupancy just depends on the amount of register/shared memory used?
What is the difference between a resident CUDA warp and an active one?
In this context presumably nothing.
From my research on the web it seems that a block is resident (i.e. allocated along with its register/shared memory files) on a SM for the entire duration of its execution. Is there a difference with "being active"?
Now you have switched from asking about warps to asking about blocks. But again, in this context no, you could consider them to be the same.
If I have a kernel which uses very few registers and shared memory..
does it mean that I can have maximum_active_warps resident blocks and
achieve 100% occupancy since occupancy just depends on the amount of
register/shared memory used?
No because a warp and a block are not the same thing. As you yourself have quoted, occupancy is defined in terms of warps, not blocks. The maximum number of warps is fixed at 48 or 64 depending on your hardware. The maximum number of blocks is fixed at 8, 16 or 32 depending on hardware. There are two independent limits which are not the same. Both can influence the effective occupancy a given kernel can achieve.
Related
I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).
If I query the maximum compute shader shared memory size with:
GLint maximum_shared_mem_size;
glGetIntegerv(GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, &maximum_shared_mem_size);
I get 48KB as a result. However, according to this whitepaper:
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
on page 13, it is stated, that for my GPU (2080TI):
The Turing L1 can be as large as 64 KB in size, combined with a 32 KB per SM shared memory allocation, or it can reduce to 32 KB, allowing 64 KB of allocation to be used for shared memory. Turing’s L2 cache capacity has also been increased.
So, I expect OpenGL to return 64KB for the maximum shared memory size.
Is this a wrong assumption? If so, why?
It looks like the 48KB is the expected result, as documented in the Turing Tuning Guide for CUDA:
Turing allows a single thread block to address the full 64 KB of shared memory. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit. See the CUDA C Programming Guide for details.
It seems that you can either take the default 48KB or use CUDA to gain control over the carveout configuration.
I'm working on GPU Tesla M6. According to its datasheet, Tesla M6 has 12 multiprocessors, and each of them holds a maximum of 32 resident blocks. So the total maximum number of blocks resident on the entire device is 384.
Now, I have a data matrix with size (512,1408). I wrote a kernel, and set the number of threads per block to 64 (1D block, one data element per thread), so the 1D gird size is 512*1408/64 = 11264 blocks, which is far beyond the number of resident blocks on the GPU. However, the whole program still can run and output correct results.
I wonder why the code can execute, although the real number of blocks exceed the resident one? Does it mean performance deterioration? Could you explain it detailedly to me? Thanks!
A GPU can hold many more blocks than what can be resident according to your calculation.
The GPU loads up as many blocks as it can on SMs, and the remainder wait in a queue. As blocks finish their work on SMs and retire, they open up space for new blocks to be selected from the queue and made "resident". Eventually, the GPU processes all blocks this way.
There isn't anything necessarily wrong with this approach; it is typical for GPU programming. It does not necessarily mean performance deterioration. However, one approach to tuning kernels for maximum performance is to choose the number of blocks based on how many can be "resident". The calculation of how many can be resident, if properly done, is more complex than what you have outlined. It requires occupancy analysis. CUDA provides an occupancy API to do this analysis at runtime.
This approach will also require design of a kernel that can get work done with an arbitrary or fixed size grid, rather than a grid size selected based on the problem size. One typical approach for this is a grid-stride loop.
If you combine a kernel design like grid-stride loop, with a choice of blocks at runtime based on occupancy analysis, then you can get your work done with only the blocks that are "resident" on the GPU; none need be in the queue, waiting. This may or may not have any tangible performance benefits. Only by benchmarking will you know for sure.
I suggest reading both articles I linked before asking follow-up questions. There are also many questions on the cuda tag discussing the concepts in this answer.
Threads in a thread blocks can have dependencies on each other. Programming models such as cooperative groups allow for large groups than a thread block. The number of thread blocks in a Grid can be orders of magnitude greater than the number of resident thread blocks (e.g. Minimum is 1 Thread Block, GV100 supports 84 x 32 2688 resident thread blocks).
The compute work distributor assigns thread blocks to SMs. If the grid is preempted the state is saved and later restored. When all threads in a thread block complete the thread block resources are released (warp slots, registers, shared memory) and the the compute work distributor is notified. The compute work distributor will continue to assign thread blocks to SMs until all work in the grid completes.
I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.
The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.
I'm still getting mad on these unknown-size matrices which may vary from 10-20.000 for each dimension.
I'm looking at the CUDA sdk and wondering: what if I choose a number of blocks too high?
Something like a grid of 9999 x 9999 blocks in the X and Y dimensions, if my hardware has SMs which can't hold all these blocks, will the kernel have problems or the performances would simply collapse?
I don't know how to dimension in blocks/threads something which may vary so much.. I'm thinking at using the MAXIMUM number of blocks my hardware supports and then making the threads inside them work across all the matrix, is this the right way?
The thread blocks do not have a one to one mapping with the cores. Blocks are scheduled to cores as they become available, meaning you can request as many as you want (up to a limit probably). Requesting a huge number of blocks would just slow the system down as it loads and unloads do-nothing thread blocks to the cores.
You can specify the dimensions of the grid and blocks at run time.
Edit: Here are the limits on the dimensions of the grid and the blocks, from the documentation.
If you choose an excessively large block size, you waste some cycles while the "dead" blocks get retired (typically only of the order of a few tens of microseconds even for the maximum grid size on a "full size" Fermi or GT200 card). It isn't a huge penalty.
But the grid dimension should always be computable a priori. Usually there is a known relationship between a quantifiable unit of data parallel work - something like one thread per data point, or one block per matrix column or whatever - which allows the required grid dimensions to be calculated at runtime.
An alternative strategy would be to use a fixed number of blocks (usually only needs to be something like 4-8 per MP on the GPU) and have each block/thread process multiple units of parallel work, so each block becomes "persistent". If there is a lot of fixed overhead costs in setup per thread, it can be a good way to amortize those fixed overheads across more work per thread.