CUDA performance improves when running more threads than there are cores - c++

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).

Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.

Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.

The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.

Related

C++ CUDA Gridsize meaning clarification

I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).

OpenCL - multiple threads on a gpu

After having parallelized a C++ code via OpenMP, I am now considering to use the GPU (a Radeon Pro Vega II) to speed up specific parts of my code. Being an OpenCL neophyte,I am currently searching for examples that can show me how to implement a multicore CPU - GPU interaction.
Here is what I want to achieve. Suppose to have a fixed short length array, say {1,2,3,4,5}, and that as an exercise, you want to compute all of the possible "right shifts" of this array, i.e.,
{5,1,2,3,4}
{4,5,1,2,3}
{3,4,5,1,2}
{2,3,4,5,1}
{1,2,3,4,5}
.
The relative OpenCL code is quite straightforward.
Now, suppose that your CPU has many cores, say 56, that each core has a different starting array and that at any random instant of time each CPU core may ask the GPU to compute the right shifts of its own array. This core, say core 21, should copy its own array into the GPU memory, run the kernel, and wait for the result. My question is "during this operation, could the others CPU cores submit a similar request, without waiting for the completion of the task submitted by core 21?"
Also, can core 21 perform in parallel another task while waiting for the completion of the GPU task?
Would you feel like suggesting some examples to look at?
Thanks!
The GPU works with a queue of kernel calls and (PCIe-) memory transfers. Within this queue, it can work on non-blocking memory transfers and a kernel at the same time, but not on two consecutive kernels. You could do several queues (one per CPU core), then the kernels from different queues can be executed in parallel, provided that each kernel only takes up a fraction of the GPU resources. The CPU core can, while the queue is being executed on the GPU, perform a different task, and with the command queue.finish() the CPU will wait until the GPU is done.
However letting multiple CPUs send tasks to a single GPU is bad practice and will not give you any performance advantage while making your code over-complicated. Each small PCIe memory transfer has a large latency overhead and small kernels that do not sufficiently saturate the GPU have bad performance.
The multi-CPU approach is only useful if each CPU sends tasks to its own dedicated GPU, and even then I would only recommend this if your VRAM of a single GPU is not enough or if you want to have more parallel troughput than a single GPU allows.
A better strategy is to feed the GPU with a single CPU core and - if there is some processing to do on the CPU side - only then parallelize across multiple CPU cores. By combining small data packets into a single large PCIe memory transfer and large kernel, you will saturate the hardware and get the best possible performance.
For more details on how the parallelization on the GPU works, see https://stackoverflow.com/a/61652001/9178992

Gflop vendor specification vs practical results

I have a questions that i can't i can't figure out.
I have a Nvidia GPU 750M and from specification it say it should have 722.7 GFlop/s. (GPU specification) but when i try the test from CUDA samples give me about 67.64 GFlop/Sec.
Why such a big difference?
Thanks.
The peak performance can only be achieved when every core is busy executing FMA on every cycle, which is impossible in a real task.
Apart from no other operation is counted as 2 operations like FMA,
For a single kernel launch, if you do some sampling in Visual Profiler you will notice there is something called stall. Each operation takes time to finish. And if another operation relies on the result of the previous one, it has to wait. This will eventually create "gaps" that a core is left idle waiting for a new operation is ready to be executed. Among them, device memory operations have HUGE latencies. If you don't do it right, your code will end up busy waiting for memory operations all the time.
Some tasks can be well optimized. If you test on gemm in cuBLAS, it can reach over 80% of the peak FLOPS, on some devices even 90%. While some other tasks just can not be optimized for FLOPS. For example, if you add one vector to another, the performance is always be limited by the memory bandwidth, and you can never see high FLOPS.

Will CUDA API affect CPU's Ram access performance?

* More testing shows CPU's Ram slowness is not related to CUDA. It turns out Func2(CPU) is CPU intensive but not memory intensive, then for my program1, the pressure on memory is less as it's Func2 who is occupying CPU. For program2(GPU), as Func2 becomes very fast with GPU, Func1 occupies CPU and put a lot of pressure on memory, leading to Func1's slowness *
Short version: if I run 20 processes concurrently on the same server, I noticed CPU's running speed is much slower when GPU is involved (vs pure CPU processes)
Long version:
My server: Win Server 2012, 48 cores (hyperthreaded from 24), 192 GB Ram (the 20 processes will only use ~40GB), 4 K40 cards
My program1 (CPU Version):
For 30 iterations:
Func1(CPU): 6s (lot's CPU memory access)
Func2(CPU): 60s (lot's CPU memory access)
My program2 (GPU Version, to use CPU cores for Func1, and K40s for Func2):
//1 K40 will hold 5 contexts at the same time, till end of the 30 iterations
cudaSetDevice //very slow, can be few seconds
cudaMalloc ~1GB //can be another few seconds
For 30 iterations:
Func1(CPU): 6s
Func2(GPU), 1s (60X speedup) //share GPU via named_mutex
If I run 20 program1(CPU) together, I noticed Func1's 6s becomes 12s on average
While for 20 program2(GPU), Func1 takes ~42s to complete, while my Func2(GPU) is still ~1s (This 1s includes locking GPU, some cudaMemcpy & the kernel call. I presume this includes GPU context switching also). So seems GPU's own performance is not affected much, while CPU does (by GPU)
So I suspect cudaSetDevice/cudaMalloc/cudaMemcpy is affecting CPU's Ram access? If it's true, parallelization using both multi-core CPU & GPU will be affected.
Thanks.
This is almost certainly caused by resource contention.
When, in the standard API, you run 20 processes, you have 20 separate GPU contexts. Every time one of those processes wishes to perform an API call, there must be a context switch to that processes context. Context switching is expensive and has a lot of latency. That will be the source of the slow down you are seeing. Nothing to do with memory performance.
NVIDIA has released a system called MPS (Multi Process Server) which reimplements the CUDA API as a service and internally exploits the Hyper-Q facility of modern TesLa cards to push operations onto the wide command queue which Hyper-Q supports. This removes all the context switching latency. It sounds like you might want to investigate this, if performances is important to you and your code requires a large number of processes sharing a single device

CUDA Warps and Optimal Number of Threads Per Block

From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions:
1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32? This is of course, ignoring any divergence or even cases where something like a global memory access can cause a warp to stall and have the scheduler switch to another.
2) If the above is correct, is the best possible outcome for a single SMX unit to work on 128 threads simultaneously? And for something like a GTX Titan that has 14 SMX units, a total of 128x14 = 1792 threads? I see numbers online that says otherwise. That a Titan can run 14x64 (max warps per SMX) x32(threads per SMX) = 28,672 concurrently. How can that be is SMX units launch warps, and only have 4 warp schedulers? They cannot launch all 2048 threads per SMX at once? Maybe I'm confused as to the definition of the maximum number of threads the GPU can launch concurrently, with what you are allowed to queue?
I appreciate answers and clarification on this.
so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel?
Instructions from up to 4 warps can be scheduled in any given clock cycle on a kepler SMX. However due to pipelines in execution units, in any given clock cycle, instructions may be in various stages of pipeline execution from any and up to all warps currently resident on the SMX.
And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32?
I'm not sure how you jumped from the previous point to this one. Since instruction mix presumably varies from warp to warp (since different warps are presumably at different points in the instruction stream) and instruction mix varies from one place to another in the instruction stream, I don't see any logical connection between 4 warps schedulable in a given clock cycle, and any need to have groups of 4 warps. A given warp may be at a point where its instructions are highly schedulable (perhaps at a sequence of SP FMA, requiring SP cores, which are plentiful), and another 3 warps may be at another point in the instruction stream where their instructions are "harder to schedule" (perhaps requiring SFUs, which there are fewer of). Therefore arbitrarily grouping warps into sets of 4 doesn't make much sense. Note that we don't require divergence for warps to get out of sync with each other. The natural behavior of the scheduler coupled with the varying availability of execution resources could create warps that were initially together, to be at different points in the instruction stream.
For your second question, I think your fundamental knowledge gap is in understanding how a GPU hides latency. Suppose a GPU has a set of 3 instructions to issue across a warp:
LD R0, a[idx]
LD R1, b[idx]
MPY R2, R0, R1
The first instruction is a LD from global memory, and it can be issued and does not stall the warp. The second instruction likewise can be issued. The warp will stall at the 3rd instruction, however, due to latency from global memory. Until R0 and R1 become properly populated, the multiply instruction cannot be dispatched. Latency from main memory prevents it. The GPU deals with this problem by (hopefully) having a ready supply of "other work" it can turn to, namely other warps in an unstalled state (i.e. that have an instruction that can be issued). The best way to facilitate this latency-hiding process is to have many warps available to the SMX. There isn't any granularity to this (such as needing 4 warps). Generally speaking, the more threads/warps/blocks that are in your grid, the better chance the GPU will have of hiding latency.
So it is true that the GPU cannot "launch" 2048 threads (i.e. issue instructions from 2048 threads) in a single clock cycle. But when a warp stalls, it is put into a waiting queue until the stall condition is lifted, and until then, it is helpful to have other warps "ready to go", for the next clock cycle(s).
GPU latency hiding is a commonly misunderstood topic. There are many available resources to learn about it if you search for them.