Opencl launch concurrent kernels - concurrency

As far as I understand, to execute concurrent kernels (in my case same kernel but different I/O data) it must be done by launching unique compute units (streaming multiprocessors -SMs) with apparently their own workgroups.
For example gtx960m has 5 SMs (compute units in Opencl). Launching clEnqueueNDRangeKernel 5 times asynchronously and out of order, with their own 16x16 (2d) workgroup, will launch all 5 compute units to execute them concurrently? The local memory reported is 64kb. That is for all compute units or each one will have 64kb by its own?

Each CU has either 4 (Maxwell/Pascal) or 2 (Turing/Ampere, AMD) Warps. A Warp is a group of 32 CUDA cores / stream processors in hardware.
All threads running in one Warp have to do exactly the same instructions. Within a Warp, not even branching is possible. Two Warps within a CU can handle different branches, but not different kernels at the same time.
If you execute two kernels in different queues in parallel on your 960m with 5 CUs, Kernel 1 can have for example 3 CUs and kernel 2 can have the remaining 2. But a CU cannot be split to run multiple kernels at the same time.
In OpenCL you can set the workgroup size to some multiple of the Warp size (32). Either 4 (workgroup size 32), 2 (workgroup size 64) or 1 (workgroup size 128 or greater) OpenCL workgroups can be executed at one moment on one Maxwell CU.
The amount of local memory, in your case 64KB, is per CU. So if you have a large workgroup of for example 256 threads, each thread has less local memory available than if you have workgroup size 64, because all threads in the workgroup share the same local memory uf the one CU they run on.

Related

C++ CUDA Gridsize meaning clarification

I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).

Concurrency of cuFFT streams

So I am using cuFFT combined with the CUDA stream feature. The problem I have is that I can't seem to make the cuFFT kernels run in full concurrency. The following is the results I have from nvvp. Each of the stream is running a kernel of 2D batch FFT on 128 images of size 128x128. I setup 3 streams to run 3 independent FFT batch plan.
As can be seen from the figure, some memory copies (yellow bars) were in concurrent with some kernel computations (purple, brown and pink bars). But the kernels runs were not in concurrent at all. As you notice each kernel was strictly following each other. The following is the code I used for memory copy to the device and kernel launching.
for (unsigned int j = 0; j < NUM_IMAGES; j++ ) {
gpuErrchk( cudaMemcpyAsync( dev_pointers_in[j],
image_vector[j],
NX*NY*NZ*sizeof(SimPixelType),
cudaMemcpyHostToDevice,
streams_fft[j]) );
gpuErrchk( cudaMemcpyAsync( dev_pointers_out[j],
out,
NX*NY*NZ*sizeof(cufftDoubleComplex),
cudaMemcpyHostToDevice,
streams_fft[j] ) );
cufftExecD2Z( planr2c[j],
(SimPixelType*)dev_pointers_in[j],
(cufftDoubleComplex*)dev_pointers_out[j]);
}
Then I changed my code so that I finished all memory copies (synchronize) and send all kernels to streams at once and I got the following profiling result:
Then I was confirmed that the kernels were not running in a concurrent way.
I looked at one link which explains in details how to setup to utilize full concurrency by either passing "–default-stream per-thread" command line argument or #define CUDA_API_PER_THREAD_DEFAULT_STREAM before you #include or in your code. It is a feature introduced in CUDA 7. I ran the sample code in the above link on my MacBook Pro Retina 15' with GeForce GT750M (The same machine used as in the above link), And I was able to get concurrent kernel runs. But I was not able to get my cuFFT kernels running in parallel.
Then I found this link with someone saying that cuFFT kernel will occupy the whole GPU so no two cuFFT kernels running parallel. Then I was stuck. Since I didn't find any formal documentation addressing whether CUFFT enables concurrent kernels. It this true? Is there a way to get around with this?
I assume you called cufftSetStream() prior to the code you have shown, appropriate for each planr2c[j], so that each plan is associated with a separate stream. I don't see it in the code you posted. If you actually want cufft kernels to overlap with other cufft kernels, it's necessary for those kernels to be launched to separate streams. So the cufft exec call for image 0 would have to be launched into a different stream than the cufft exec call for image 1, for example.
In order for any two CUDA operations to have the possibility to overlap, they must be launched into different streams.
Having said that, concurrent memory copies with kernel execution, but not concurrent kernels, is about what I would expect for reasonable sized FFTs.
A 128x128 FFT to a first order approximation will spin up ~15,000 threads, so if my thread blocks are ~500 threads each, that would be 30 threadblocks, which will keep a GPU fairly occupied, leaving not much "room" for additional kernels. (You can actually discover the total blocks and threads for a kernel in the profiler itself.) Your GT750m probably has 2 Kepler SMs with a maximum of 16 blocks per SM so a max instantaneous capacity of 32 blocks. And this capacity number could be reduced for a specific kernel due to shared memory usage, register usage, or other factors.
The instantaneous capacity of whatever GPU you are running on (max blocks per SM * number of SMs) will determine the potential for overlap (concurrency) of kernels. If you exceed that capacity with a single kernel launch, then that will "fill" the GPU, preventing kernel concurrency for some time period.
It should be theoretically possible for CUFFT kernels to run concurrently. But just like any kernel concurrency scenario, CUFFT or otherwise, the resource usage of those kernels would have to be pretty low to actually witness concurrency. Typically when you have low resource usage, it implies kernels with a relatively small number of threads/threadblocks. These kernels don't usually take long to execute, making it even more difficult to actually witness concurrency (because launch latency and other latency factors may get in the way). The easiest way to witness concurrent kernels is to have kernels with unusually low resource requirements combined with unusually long run times. This is generally not the typical scenario, for CUFFT kernels or any other kernels.
Overlap of copy and compute is a still a useful feature of streams with CUFFT. And the concurrency idea, without a basis of understanding of the machine capacity and resource constraints, is somewhat unreasonable in itself. For example, if kernel concurrency was an arbitrary achievable ("I should be able to make any 2 kernels run concurrently"), without consideration to capacity or resource specifics, then after you get two kernels running concurrently, the next logical step would be to go to 4, 8, 16 kernels concurrently. But the reality is that the machine can't handle that much work simultaneously. Once you've exposed enough parallelism (loosely translated as "enough threads") in a single kernel launch, exposing additional work parallelism via additional kernel launches normally cannot make the machine run any faster, or process the work quicker.

CUDA Warps and Optimal Number of Threads Per Block

From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions:
1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32? This is of course, ignoring any divergence or even cases where something like a global memory access can cause a warp to stall and have the scheduler switch to another.
2) If the above is correct, is the best possible outcome for a single SMX unit to work on 128 threads simultaneously? And for something like a GTX Titan that has 14 SMX units, a total of 128x14 = 1792 threads? I see numbers online that says otherwise. That a Titan can run 14x64 (max warps per SMX) x32(threads per SMX) = 28,672 concurrently. How can that be is SMX units launch warps, and only have 4 warp schedulers? They cannot launch all 2048 threads per SMX at once? Maybe I'm confused as to the definition of the maximum number of threads the GPU can launch concurrently, with what you are allowed to queue?
I appreciate answers and clarification on this.
so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel?
Instructions from up to 4 warps can be scheduled in any given clock cycle on a kepler SMX. However due to pipelines in execution units, in any given clock cycle, instructions may be in various stages of pipeline execution from any and up to all warps currently resident on the SMX.
And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32?
I'm not sure how you jumped from the previous point to this one. Since instruction mix presumably varies from warp to warp (since different warps are presumably at different points in the instruction stream) and instruction mix varies from one place to another in the instruction stream, I don't see any logical connection between 4 warps schedulable in a given clock cycle, and any need to have groups of 4 warps. A given warp may be at a point where its instructions are highly schedulable (perhaps at a sequence of SP FMA, requiring SP cores, which are plentiful), and another 3 warps may be at another point in the instruction stream where their instructions are "harder to schedule" (perhaps requiring SFUs, which there are fewer of). Therefore arbitrarily grouping warps into sets of 4 doesn't make much sense. Note that we don't require divergence for warps to get out of sync with each other. The natural behavior of the scheduler coupled with the varying availability of execution resources could create warps that were initially together, to be at different points in the instruction stream.
For your second question, I think your fundamental knowledge gap is in understanding how a GPU hides latency. Suppose a GPU has a set of 3 instructions to issue across a warp:
LD R0, a[idx]
LD R1, b[idx]
MPY R2, R0, R1
The first instruction is a LD from global memory, and it can be issued and does not stall the warp. The second instruction likewise can be issued. The warp will stall at the 3rd instruction, however, due to latency from global memory. Until R0 and R1 become properly populated, the multiply instruction cannot be dispatched. Latency from main memory prevents it. The GPU deals with this problem by (hopefully) having a ready supply of "other work" it can turn to, namely other warps in an unstalled state (i.e. that have an instruction that can be issued). The best way to facilitate this latency-hiding process is to have many warps available to the SMX. There isn't any granularity to this (such as needing 4 warps). Generally speaking, the more threads/warps/blocks that are in your grid, the better chance the GPU will have of hiding latency.
So it is true that the GPU cannot "launch" 2048 threads (i.e. issue instructions from 2048 threads) in a single clock cycle. But when a warp stalls, it is put into a waiting queue until the stall condition is lifted, and until then, it is helpful to have other warps "ready to go", for the next clock cycle(s).
GPU latency hiding is a commonly misunderstood topic. There are many available resources to learn about it if you search for them.

Cuda block or thread preference

The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?
I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.
CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.
You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.
If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).
These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).
Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.
Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.
The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.