CUDA - what if I choose too many blocks? - c++

I'm still getting mad on these unknown-size matrices which may vary from 10-20.000 for each dimension.
I'm looking at the CUDA sdk and wondering: what if I choose a number of blocks too high?
Something like a grid of 9999 x 9999 blocks in the X and Y dimensions, if my hardware has SMs which can't hold all these blocks, will the kernel have problems or the performances would simply collapse?
I don't know how to dimension in blocks/threads something which may vary so much.. I'm thinking at using the MAXIMUM number of blocks my hardware supports and then making the threads inside them work across all the matrix, is this the right way?

The thread blocks do not have a one to one mapping with the cores. Blocks are scheduled to cores as they become available, meaning you can request as many as you want (up to a limit probably). Requesting a huge number of blocks would just slow the system down as it loads and unloads do-nothing thread blocks to the cores.
You can specify the dimensions of the grid and blocks at run time.
Edit: Here are the limits on the dimensions of the grid and the blocks, from the documentation.

If you choose an excessively large block size, you waste some cycles while the "dead" blocks get retired (typically only of the order of a few tens of microseconds even for the maximum grid size on a "full size" Fermi or GT200 card). It isn't a huge penalty.
But the grid dimension should always be computable a priori. Usually there is a known relationship between a quantifiable unit of data parallel work - something like one thread per data point, or one block per matrix column or whatever - which allows the required grid dimensions to be calculated at runtime.
An alternative strategy would be to use a fixed number of blocks (usually only needs to be something like 4-8 per MP on the GPU) and have each block/thread process multiple units of parallel work, so each block becomes "persistent". If there is a lot of fixed overhead costs in setup per thread, it can be a good way to amortize those fixed overheads across more work per thread.

Related

How to enqueue as many kernels as there are threads available on version 1.2

I'm trying to do some calculations where it starts off with 10-20~ objects, but by doing calculations on these objects it creates 20-40 and so on and so forth, so slightly recursive but not forever, eventually the amount of calculations will reach zero. I have considered using a different tool but it's kind of too late for that for me. It's kind of an odd request which is probably why no results came up.
In short I'm wondering how it is possible to set global work size to as many threads as there are available. For example if the GPU can have X different processes running in parallel it will set that to global work size to X.
edit:it would also work if I can call more kernels from the GPU but that doesn't look possible on version 1.2.
There is not really a limit to global work size (only above 2^32 threads you have to use 64-bit ulong to avoid integer overflow), and the hard limit at 2^64 threads is so large that you can never possibly come even close to it.
If you need a billion threads, than set global work size to a billion threads. The GPU scheduler and hardware will handle that just fine, even if the GPU only has a few thousand physical cores. In fact, you should always launch much more threads than there are cores on the GPU; otherwise the hardware won't be fully saturated and you loose performance.
Only issue could be to run out of GPU memory.
Launching kernels from within kernels is only possible in OpenCL 2.0-2.2, on AMD or Intel GPUs.
It sounds like each iteration depends on the result of the previous one. In that case, your limiting factor is not the number of available threads. You cannot cause some work-items to "wait" for others submitted by the same kernel enqueueing API call (except to a limited extent within a work group).
If you have an OpenCL 2.0+ implementation at your disposal, you can queue subsequent iterations dynamically from within the kernel. If not, and you have established that your bottleneck is checking whether another iteration is required and the subsequent kernel submission, you could try the following:
Assuming a work-item can trivially determine how many threads are actually needed for an iteration based on the output of the previous iteration, you could speculatively enqueue multiple batches of the kernel, each of which depends on the completion event of the previous batch. Inside the kernel, you can exit early if the thread ID is greater or equal the number of threads required in that iteration.
This only works if you either have a hard upper bound or can make a reasonable guess that will yield sensible results (with acceptable perf characteristics if the guess is wrong) for:
The maximum number of iterations.
The number of work-items required on each iteration.
Submitting, say UINT32_MAX work items for each iteration will likely not make any sense in terms of performance, as the number of work-items that fail the check for whether they are needed will dominate.
You can work around incorrect guesses for the latter number by surrounding the calculation with a loop, so that work item N will calculate both item N and M+N if the number of items on an iteration exceeds M, where M is the enqueued work size for that iteration.
Incorrect guesses for the number of iterations would need to be detected on the host, and more iterations enqueued.
So it becomes a case of performing a large number of runs with different guesses and gathering statistics on how good the guesses are and what overall performance they yielded.
I can't say whether this will yield acceptable performance in general - it really depends on the calculations you are performing and whether they are a good fit for GPU-style parallelism, and whether the overhead of the early-out for a potentially large number of work items becomes a problem.

Why my GPU program can execute, although the number of blocks exceeds the number of resident blocks?

I'm working on GPU Tesla M6. According to its datasheet, Tesla M6 has 12 multiprocessors, and each of them holds a maximum of 32 resident blocks. So the total maximum number of blocks resident on the entire device is 384.
Now, I have a data matrix with size (512,1408). I wrote a kernel, and set the number of threads per block to 64 (1D block, one data element per thread), so the 1D gird size is 512*1408/64 = 11264 blocks, which is far beyond the number of resident blocks on the GPU. However, the whole program still can run and output correct results.
I wonder why the code can execute, although the real number of blocks exceed the resident one? Does it mean performance deterioration? Could you explain it detailedly to me? Thanks!
A GPU can hold many more blocks than what can be resident according to your calculation.
The GPU loads up as many blocks as it can on SMs, and the remainder wait in a queue. As blocks finish their work on SMs and retire, they open up space for new blocks to be selected from the queue and made "resident". Eventually, the GPU processes all blocks this way.
There isn't anything necessarily wrong with this approach; it is typical for GPU programming. It does not necessarily mean performance deterioration. However, one approach to tuning kernels for maximum performance is to choose the number of blocks based on how many can be "resident". The calculation of how many can be resident, if properly done, is more complex than what you have outlined. It requires occupancy analysis. CUDA provides an occupancy API to do this analysis at runtime.
This approach will also require design of a kernel that can get work done with an arbitrary or fixed size grid, rather than a grid size selected based on the problem size. One typical approach for this is a grid-stride loop.
If you combine a kernel design like grid-stride loop, with a choice of blocks at runtime based on occupancy analysis, then you can get your work done with only the blocks that are "resident" on the GPU; none need be in the queue, waiting. This may or may not have any tangible performance benefits. Only by benchmarking will you know for sure.
I suggest reading both articles I linked before asking follow-up questions. There are also many questions on the cuda tag discussing the concepts in this answer.
Threads in a thread blocks can have dependencies on each other. Programming models such as cooperative groups allow for large groups than a thread block. The number of thread blocks in a Grid can be orders of magnitude greater than the number of resident thread blocks (e.g. Minimum is 1 Thread Block, GV100 supports 84 x 32 2688 resident thread blocks).
The compute work distributor assigns thread blocks to SMs. If the grid is preempted the state is saved and later restored. When all threads in a thread block complete the thread block resources are released (warp slots, registers, shared memory) and the the compute work distributor is notified. The compute work distributor will continue to assign thread blocks to SMs until all work in the grid completes.

OpenCL: how lightweight are GPU threads?

I keep reading that GPU threads are lightweight and you can throw many tasks at them to complete in parallel....but how lightweight are they, exactly?
Let's say I have a million-member float3 array, and I want to calculate the length of each float3 value.
Does it make sense to send essentially 1 million tasks to the GPU (so the kernel calculates a single float3 length of the global array and returns)? Or something more like 1000 tasks, and each kernel execution loops through 1000 members of the array? If there is a benefit to grouping tasks like that, is there a way to calculate the optimal size of each grouping?
If we're talking about GPUs only, the answer is - very lightweight.
Does it make sense to send essentially 1 million tasks to the GPU
You're not "sending a million tasks" to the GPU. You're sending a single request, which is a few dozen bytes, which essentially says "please launch a million copies of this code with the grid coordinates i give you here". Those "copies" are created on the fly by hardware inside the GPU, and yes it's very efficient.
1000 tasks, and each kernel execution loops through 1000 members of the array
On a GPU, you almost certainly don't want to do this. A modern high-end GPU has easily 4000+ processing units, so you need at minimum that amount of concurrency. But usually much higher. There is a scheduler which picks one hardware thread to run on each of those processing units, and usually there are several dozen hardware threads per processing unit. So it's not unusual to see a GPU with 100K+ hardware threads. This is required to hide memory latencies.
So if you launch a kernel with 1000x1 grid size, easily 3/4 of your GPU could be unused, and the used part will spend 90% of it's time waiting for memory. Go ahead and try it out. The GPU has been designed to handle ridiculous amounts of threads - don't be afraid to use them.
Now, if you're talking about CPU, that's a slightly different matter. CPUs obviously don't have 1000s of hardware threads. Here, it depends on the OpenCL implementation - but i think most reasonable CPU OpenCL implementations today will handle this for you, by processing work in loops, in just enough hardware threads for your CPU.
TL;DR: use the "1 million tasks" solution, and perhaps try tuning the local work size.

Cuda block or thread preference

The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?
I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.
CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.
You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.
If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).
These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/
Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page
My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.