I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).
Related
I'm working on GPU Tesla M6. According to its datasheet, Tesla M6 has 12 multiprocessors, and each of them holds a maximum of 32 resident blocks. So the total maximum number of blocks resident on the entire device is 384.
Now, I have a data matrix with size (512,1408). I wrote a kernel, and set the number of threads per block to 64 (1D block, one data element per thread), so the 1D gird size is 512*1408/64 = 11264 blocks, which is far beyond the number of resident blocks on the GPU. However, the whole program still can run and output correct results.
I wonder why the code can execute, although the real number of blocks exceed the resident one? Does it mean performance deterioration? Could you explain it detailedly to me? Thanks!
A GPU can hold many more blocks than what can be resident according to your calculation.
The GPU loads up as many blocks as it can on SMs, and the remainder wait in a queue. As blocks finish their work on SMs and retire, they open up space for new blocks to be selected from the queue and made "resident". Eventually, the GPU processes all blocks this way.
There isn't anything necessarily wrong with this approach; it is typical for GPU programming. It does not necessarily mean performance deterioration. However, one approach to tuning kernels for maximum performance is to choose the number of blocks based on how many can be "resident". The calculation of how many can be resident, if properly done, is more complex than what you have outlined. It requires occupancy analysis. CUDA provides an occupancy API to do this analysis at runtime.
This approach will also require design of a kernel that can get work done with an arbitrary or fixed size grid, rather than a grid size selected based on the problem size. One typical approach for this is a grid-stride loop.
If you combine a kernel design like grid-stride loop, with a choice of blocks at runtime based on occupancy analysis, then you can get your work done with only the blocks that are "resident" on the GPU; none need be in the queue, waiting. This may or may not have any tangible performance benefits. Only by benchmarking will you know for sure.
I suggest reading both articles I linked before asking follow-up questions. There are also many questions on the cuda tag discussing the concepts in this answer.
Threads in a thread blocks can have dependencies on each other. Programming models such as cooperative groups allow for large groups than a thread block. The number of thread blocks in a Grid can be orders of magnitude greater than the number of resident thread blocks (e.g. Minimum is 1 Thread Block, GV100 supports 84 x 32 2688 resident thread blocks).
The compute work distributor assigns thread blocks to SMs. If the grid is preempted the state is saved and later restored. When all threads in a thread block complete the thread block resources are released (warp slots, registers, shared memory) and the the compute work distributor is notified. The compute work distributor will continue to assign thread blocks to SMs until all work in the grid completes.
My GPU is NVIDIA GeForce GT440, whose compute capability version is 2.x. NVIDIA's official CUDA_C_Programming_Guide points out
Limit 1. Maximum number of threads per block = 1024
Limit 2. Maximum number of resident threads per multiprocessor = 1536
However, two of the OpenGL computer shader implementation limits are
Limit 3. GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS = 1536
My questions are
1. Why Limit 1 is not equal to Limit 2 and Limit 3?
2. Should the real threads/block (invocations/workgroup) be 1024 or 1536?
Why Limit 1 is not equal to Limit 2 and Limit 3?
Because it isn't the same thing. Blocks are a logical construct in CUDA and are limited to a maximum of 1024 threads. But a multiprocessor can run multiple blocks concurrently (up to 8 in the case of your hardware). So a SM can have up to 1536 concurrent threads in your hardware, but not all of those threads can come from a single block.
Should the real threads/block be 1024 or 1536?
1024 for all the reasons outlined above. You can see a complete summary of the capabilities of all supported hardware here.
The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?
I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.
CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.
You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.
If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).
These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).
Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).
Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.
Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.
The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.
I'm still getting mad on these unknown-size matrices which may vary from 10-20.000 for each dimension.
I'm looking at the CUDA sdk and wondering: what if I choose a number of blocks too high?
Something like a grid of 9999 x 9999 blocks in the X and Y dimensions, if my hardware has SMs which can't hold all these blocks, will the kernel have problems or the performances would simply collapse?
I don't know how to dimension in blocks/threads something which may vary so much.. I'm thinking at using the MAXIMUM number of blocks my hardware supports and then making the threads inside them work across all the matrix, is this the right way?
The thread blocks do not have a one to one mapping with the cores. Blocks are scheduled to cores as they become available, meaning you can request as many as you want (up to a limit probably). Requesting a huge number of blocks would just slow the system down as it loads and unloads do-nothing thread blocks to the cores.
You can specify the dimensions of the grid and blocks at run time.
Edit: Here are the limits on the dimensions of the grid and the blocks, from the documentation.
If you choose an excessively large block size, you waste some cycles while the "dead" blocks get retired (typically only of the order of a few tens of microseconds even for the maximum grid size on a "full size" Fermi or GT200 card). It isn't a huge penalty.
But the grid dimension should always be computable a priori. Usually there is a known relationship between a quantifiable unit of data parallel work - something like one thread per data point, or one block per matrix column or whatever - which allows the required grid dimensions to be calculated at runtime.
An alternative strategy would be to use a fixed number of blocks (usually only needs to be something like 4-8 per MP on the GPU) and have each block/thread process multiple units of parallel work, so each block becomes "persistent". If there is a lot of fixed overhead costs in setup per thread, it can be a good way to amortize those fixed overheads across more work per thread.