CUDA Warps and Optimal Number of Threads Per Block

CUDA Warps and Optimal Number of Threads Per Block - c++

From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions:
1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32? This is of course, ignoring any divergence or even cases where something like a global memory access can cause a warp to stall and have the scheduler switch to another.
2) If the above is correct, is the best possible outcome for a single SMX unit to work on 128 threads simultaneously? And for something like a GTX Titan that has 14 SMX units, a total of 128x14 = 1792 threads? I see numbers online that says otherwise. That a Titan can run 14x64 (max warps per SMX) x32(threads per SMX) = 28,672 concurrently. How can that be is SMX units launch warps, and only have 4 warp schedulers? They cannot launch all 2048 threads per SMX at once? Maybe I'm confused as to the definition of the maximum number of threads the GPU can launch concurrently, with what you are allowed to queue?
I appreciate answers and clarification on this.

so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel?
Instructions from up to 4 warps can be scheduled in any given clock cycle on a kepler SMX. However due to pipelines in execution units, in any given clock cycle, instructions may be in various stages of pipeline execution from any and up to all warps currently resident on the SMX.
And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32?
I'm not sure how you jumped from the previous point to this one. Since instruction mix presumably varies from warp to warp (since different warps are presumably at different points in the instruction stream) and instruction mix varies from one place to another in the instruction stream, I don't see any logical connection between 4 warps schedulable in a given clock cycle, and any need to have groups of 4 warps. A given warp may be at a point where its instructions are highly schedulable (perhaps at a sequence of SP FMA, requiring SP cores, which are plentiful), and another 3 warps may be at another point in the instruction stream where their instructions are "harder to schedule" (perhaps requiring SFUs, which there are fewer of). Therefore arbitrarily grouping warps into sets of 4 doesn't make much sense. Note that we don't require divergence for warps to get out of sync with each other. The natural behavior of the scheduler coupled with the varying availability of execution resources could create warps that were initially together, to be at different points in the instruction stream.
For your second question, I think your fundamental knowledge gap is in understanding how a GPU hides latency. Suppose a GPU has a set of 3 instructions to issue across a warp:
LD R0, a[idx]
LD R1, b[idx]
MPY R2, R0, R1
The first instruction is a LD from global memory, and it can be issued and does not stall the warp. The second instruction likewise can be issued. The warp will stall at the 3rd instruction, however, due to latency from global memory. Until R0 and R1 become properly populated, the multiply instruction cannot be dispatched. Latency from main memory prevents it. The GPU deals with this problem by (hopefully) having a ready supply of "other work" it can turn to, namely other warps in an unstalled state (i.e. that have an instruction that can be issued). The best way to facilitate this latency-hiding process is to have many warps available to the SMX. There isn't any granularity to this (such as needing 4 warps). Generally speaking, the more threads/warps/blocks that are in your grid, the better chance the GPU will have of hiding latency.
So it is true that the GPU cannot "launch" 2048 threads (i.e. issue instructions from 2048 threads) in a single clock cycle. But when a warp stalls, it is put into a waiting queue until the stall condition is lifted, and until then, it is helpful to have other warps "ready to go", for the next clock cycle(s).
GPU latency hiding is a commonly misunderstood topic. There are many available resources to learn about it if you search for them.

Related

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf
WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture
4.1. SIMT Architecture
...
A warp executes one common instruction at a time, so full efficiency
is realized when all 32 threads of a warp agree on their execution
path. If threads of a warp diverge via a data-dependent conditional
branch, the warp serially executes each branch path taken, disabling
threads that are not on that path, and when all paths complete, the
threads converge back to the same execution path. Branch divergence
occurs only within a warp; different warps execute independently
regardless of whether they are executing common or disjoint code
paths.
The SIMT architecture is akin to SIMD (Single Instruction, Multiple
Data) vector organizations in that a single instruction controls
multiple processing elements. A key difference is that SIMD vector
organizations expose the SIMD width to the software, whereas SIMT
instructions specify the execution and branching behavior of a single
thread.
WaveFront in OpenCL: https://sites.google.com/site/csc8820/opencl-basics/opencl-terms-explained#TOC-Wavefront
During runtime, the first wavefront is sent to the compute unit to
run, then the second wavefront is sent to the compute unit, and so on.
Work items within one wavefront are executed in parallel and in lock
steps. But different wavefronts are executed sequentially.
I.e. we know, that:
threads in WARP (CUDA) - are SIMT-threads, which always executes the same instructions at each time and always are stay synchronized - i.e. threads of WARP are the same as lanes of SIMD (on CPU)
threads in WaveFront (OpenCL) - are threads, which always executes in parallel, but not necessarily all the threads perform the exact same instruction, and not necessarily all of the threads are synchronized
But is there any guarantee that all of the threads in the WaveFront always synchronized such as threads in WARP or as lanes in SIMD?
Conclusion:
WaveFront-threads (items) are always synchronized - lock step: "wavefront executes a number of work-items in lock step relative to each other."
WaveFront mapped on SIMD-block: "all work-items in the wavefront go to both paths of flow control"
I.e. each WaveFront-thread (item) mapped to SIMD-lanes
page-20: http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf
Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing
1.1 Terminology
...
Wavefronts and work-groups are two concepts relating to compute
kernels that provide data-parallel granularity. A wavefront
executes a number of work-items in lock step relative to each
other. Sixteen workitems are execute in parallel across the vector
unit, and the whole wavefront is covered over four clock cycles. It
is the lowest level that flow control can affect. This means that if
two work-items inside of a wavefront go divergent paths of flow
control, all work-items in the wavefront go to both paths of flow
control.
This is true for: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
(page-45) Chapter 2 OpenCL Performance and Optimization for GCN Devices
(page-81) Chapter 3 OpenCL Performance and Optimization for Evergreen and Northern Islands Devices

First, you can query some values:
CL_DEVICE_WAVEFRONT_WIDTH_AMD
CL_DEVICE_SIMD_WIDTH_AMD
CL_DEVICE_WARP_SIZE_NV
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
but only from host side as I know.
Lets assume these queries returned 64 and your question gives importance to threads' implicit synchronization.
What if someone chooses local range = 4?
Since opencl abstracts hardware clockwork from developer, you can't know what actual SIMD or WAVEFRONT size is from within kernel execution in runtime.
For example, AMD NCU has 64 shaders but it has 16-wide SIMD, 8-wide SIMD, 4-wide SIMD, 2-wide SIMD and even two scalar units inside same compute unit.
4 local threads could be shared on two scalars and one 2-wide unit or any other combination of SIMDs. Kernel code can't know this. Even if it knows somehow computing things, you can't know which SIMD combination will be used for next kernel execution(or even next workgroup) at runtime in a random compute-unit(64 shaders).
Or a GCN CU, which has 4x16 SIMDs in it, could allocate 1 thread per SIMD, making all 4 threads totally independent. If they all reside in same SIMD, youre lucky. There is no guarantee knowing that "before" kernel execution. Even after you know, next kernel could be different since there is no guarantee of choosing same SIMD allocation(background kernels, 3d visualization softwares, even OS could be putting bubbles in pipelines)
There is no guarantee of commanding/hinting/querying of N threads to run as same SIMD or same WARP before kernel execution. Then in the kernel, there is no command to get a thread's wavefront index just like get_global_id(0). Then after kernel, you can't rely on array results since next kernel execution may not use same SIMDs for exact same items. Even some items from other wavefronts could be swapped with an item from current wavefront just for an optimization by driver or hardware (nvidia has loadbalancer lately and could have been doing this, also NCU of amd may use similar thing in future)
Even if you guess right combination of threads on SIMDs on your hardware and driver, it could be totally different in another computer.
If its for a performance point of view, you could try:
zero-branch in kernel code
zero kernels running in background
gpu is not being used for monitor output
gpu is not being used for some visualization software
Just to make sure %99 probability, there are no bubbles in pipelines so all threads retire an instruction at the same cycle(or at least synchronize at latest retiring one).
Or, add a fence after every instruction to synchronize on global or local memory which is very slow. Fences make workitem level synhronization, barriers make local group level synchronization. There are no wavefront synchronization commands.
Then, those threads that run within same SIMD will behave synchronized. But you may not know which threads those are and which SIMDs.
For the 4-thread example, using float16 for all calculations may let the driver use 16-wide SIMDs of AMD GCN CU to compute but then they are not threads anymore, only variables. But this should have better synchronization on data, than threads.
There are more complex situations such as:
4 threads in same SIMD but one thread calculation generates some NaN value and does an extra normalization(taking 1-2 cycle maybe) on that. 3 others should wait for completion but it works independently of data related slowdowns.
4 threads in same wavefront are in a loop and one of them stuck forever. 3 of them wait for the 4th one to finish forever or driver detects and moves it to another free-empty SIMD? Or 4th one waits for other 3 at the same time because they are not moving too!
4 threads doing atomic operations, one by one.
Amd's HD5000 series gpu has SIMD width 4(or 5) but wavefront size is 64.

Wavefronts guarantee lockstep. That's why on older compilers you can omit the synchronizations if your local group contains only one wavefront. (You can no longer do this on newer compilers, who will interprete the dependency wrong and give you wrong code. But on the other hand newer compilers will omit the synchronizations for you if your local group contains only one wavefront.)
One stream processer is like one core of CPU. It will repeatly run one 16-wide vector instruction four times to fulfill the 64 so called "threads" in a wavefront. Actually one wavefront is more a thread than what we call a thread on a GPU.

Running a single block with multiple threads, CUDA

I know that you should generally have at least 32 threads running per block on CUDA since threads are executed in groups of 32. However I was wondering if it is considered an acceptable practice to have only one block with a bunch of threads (I know there is a limit on the number of threads). I am asking this because I have some problems which require the shared memory of threads and synchronization across every element of the computation. I want to launch my kernel like
computeSomething<<< 1, 256 >>>(...)
and just used the threads to do the computation.
Is this efficient to just have one block, or would I be better off just doing the computation on the cpu?

If you care about performance, it's a bad idea.
The principal reason is that a given threadblock can only occupy the resources of a single SM on a GPU. Since most GPUs have 2 or more SMs, this means you're leaving somewhere between 50% to over 90% of the GPU performance untouched.
For performance, both of these kernel configurations are bad:
kernel<<<1, N>>>(...);
and
kernel<<<N, 1>>>(...);
The first is the case you're asking about. The second is the case of a single thread per threadblock; this leaves about 97% of the GPU horsepower untouched.
In addition to the above considerations, GPUs are latency hiding machines and like to have a lot of threads, warps, and threadblocks available, to select work from, to hide latency. Having lots of available threads helps the GPU to hide latency, which generally will result in higher efficiency (work accomplished per unit time.)
It's impossible to tell if it would be faster on the CPU. You would have to benchmark and compare. If all of the data is already on the GPU, and you would have to move it back to the CPU to do the work, and then move the results back to the GPU, then it might still be faster to use the GPU in a relatively inefficient way, in order to avoid the overhead of moving data around.

Cuda block or thread preference

The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?

I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.
CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.
You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.
If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).
These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/

Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page

My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).

Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.

Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.

The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js