CUDA concurrent kernel execution with multiple kernels per stream

CUDA concurrent kernel execution with multiple kernels per stream - concurrency

Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n kernels on n streams could theoretically run concurrently if the they are fitting into the hardware, right?
Now I'm facing the following problem: There are not n distinct kernels but n*m where m kernels need to be executed in order. For instance n=2 and m=3 would lead to the following execution scheme with streams:
Stream 1: <<<Kernel 0.1>>> <<<Kernel 1.1>>> <<<Kernel 2.1>>>
Stream 2: <<<Kernel 0.2>>> <<<Kernel 1.2>>> <<<Kernel 2.2>>>
My naive assumption is that the kernels x.0 and y.1 should execute concurrently (from a theoretic point of view) or at least not consecutively (from a practical point of view). But my measurements are showing me that this is not the case and it seems that consecutive execution is performed (i. e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels itself are very small, so concurrent execution should not be a problem.
Now my approach would be to accomplish a kind of dispatching for making sure that the kernels are en-queued in an interleaved style into the scheduler on the GPU. But when dealing with a large number of streams / kernels this could do more harm than good.
Alright, coming straight to the point: What would be an appropriate (or at least different) approach to solve this situation?
Edit: Measurements are done by using CUDA events. I've measured the time that is needed to fully solve the computation, i. e. the GPU has to compute all n * m kernels. The assumption is: On fully concurrent kernel execution the execution time is roughly (ideally) 1/n times of the time that is needed to execute all kernels in order, whereby it must be possible that two or more kernels can be executed concurrently. I'm ensuring this by only using two distinct streams right now.
I can measure a clear difference regarding execution times between using the streams as described and dispatching kernels interleaved, i. e.:
Loop: i = 0 to m
EnqueueKernel(Kernel i.1, Stream 1)
EnqueueKernel(Kernel i.2, Stream 2)
versus
Loop: i = 1 to n
Loop: j = 0 to m
EnqueueKernel(Kernel j.i, Stream i)
The latter leads to a longer runtime.
Edit #2: Changed the Stream numbers to begin by 1 (instead of 0, see comments below).
Edit #3: Hardware is a NVIDIA Tesla M2090 (i. e. Fermi, compute capability 2.0)

On Fermi (aka Compute Capability 2.0) hardware it is best to interleave kernel launches to multiple streams rather than to launch all kernels to one stream, then the next stream, etc. This is because the hardware can immediately launch kernels to different streams if there are sufficient resources, whereas if subsequent launches are to the same stream there is often delay introduced, reducing concurrency. This is the reason that your first approach performs better, and this approach is the one you should choose.
Enabling profiling can also disable concurrency on Fermi, so be careful with that. Also, be careful about using CUDA events during your launch loop, as these can interfere -- best to time the whole loop using events as you are doing, for example.

Related

TBB Parallel Pipeline: Filter Timing Inconsistent

I'm programming an application that processes a video stream using a `tbb::parallel_pipeline'. My first filter contains two important operations, one that must occur immediately after the other.
My tests show that the delay between two operations is anywhere from 3 to 20 milliseconds when I set max_number_of_live_tokens to 6 (# of filters I have) but is consistently 3 to 4 milliseconds when max_number_of_live_tokens is 1. The jitter in the first case is unacceptable for my application, but I need to allow multiple tokens to be in flight simultaneously to exploit parallelism.
Here my pipeline setup:
tbb::parallel_pipeline(6, //max_number_of_live_tokens
// 1st Filter
tbb::make_filter< void, shared_ptr<PipelinePacket_t> >(tbb::filter::serial_in_order,
[&](tbb::flow_control& fc)->shared_ptr<PipelinePacket_t>
{
shared_ptr<PipelinePacket_t> pPacket = grabFrame();
return pPacket;
}
)
&
... // 5 other filters that process the image - all 'serial_in_order'
);
And here is my grabFrame() function:
shared_ptr<VisionPipeline::PipelinePacket_t> VisionPipeline::grabFrame() {
shared_ptr<PipelinePacket_t> pPacket(new PipelinePacket_t);
m_cap >> pPacket->frame; // Operation A (use opencv api to capture frame)
pPacket->motion.gyroDeg = m_imu.getGyroZDeg(); // Operation B (read a gyro value)
return pPacket;
}
I need operations A and B to happen as close as possible to each other (so that the gyro value reflects its value at the time the frame was captured).
My guess is that the jitter that occurs when multiple tokens are in flight simultaneously is caused by tasks from other filters running on the same thread as the first filter and interrupting it while grabFrame() executing. I've dug through some TBB documentation but can't find anything on how parallel_pipeline breaks up filters into tasks, so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.
Is this assumption correct? If so, how can I tell tbb not to interrupt the first filter between operations A and B with other tasks?

OpenCV is using TBB internally itself, for various operations. So if this is actually related to TBB, it's not as you were interrupted between A and B, but rather OpenCV itself is fighting for priority with the remainder of the filter chain. Unlikely though.
so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.
That is never happening. Unless there are parts in there explicitly dispatching via TBB, it has no effect whatsoever. TBB is not magically splitting your functions into tasks.
But that may not even be your only issue. If your filters happen to be heavy on memory bandwidth, it's likely the case that you are slowing down the actual capture process significantly just by concurrent execution of the image processing.
Looks like you are running the full image through 5 filters in a row, is that correct? Full resolution, not tiled? If so, most of these filter are likely not ALU constrained, but rather by memory bandwidth, as you are not staying within CPU cache bounds either.
If you wish to go parallel, you must get rid of the write-backs to main memory in between the filter stages. The only way to do that, is to either start tiling the images in the input filter of the filter chain, or to write a custom all-in-one filter kernel. If you have filters in that chain with spatial dependencies, that's obviously not as easy as I make it sound, then you have to include some overlap in the upper stages.
max_number_of_live_tokens then actually has a real meaning. It's the number of "tiles" in flight. Which is not primarily intended to limit each filter stage to only one concurrent execution (that's not happening anyway), but rather to keep the maximum working set size under control.
E.g. if you know that each of your tiles is now 128kB in size, you know that there are 2 copies involved in each filter (source and destination), and you know you have a 2MB L3 cache, then you would know that you can afford to have 8 tokens in flight without spilling to main memory. If you also happen to have (at least) 8 CPU cores, that yields ideal throughput, but even if you don't, at least you are not risking to become bottle-necked by exceeding cache size. Of course you can afford some spilling to main memory (past what you calculated to be safe), but then you have to perform in-depth profiling of your system to see if you are getting constrained.

Concurrency of cuFFT streams

So I am using cuFFT combined with the CUDA stream feature. The problem I have is that I can't seem to make the cuFFT kernels run in full concurrency. The following is the results I have from nvvp. Each of the stream is running a kernel of 2D batch FFT on 128 images of size 128x128. I setup 3 streams to run 3 independent FFT batch plan.
As can be seen from the figure, some memory copies (yellow bars) were in concurrent with some kernel computations (purple, brown and pink bars). But the kernels runs were not in concurrent at all. As you notice each kernel was strictly following each other. The following is the code I used for memory copy to the device and kernel launching.
for (unsigned int j = 0; j < NUM_IMAGES; j++ ) {
gpuErrchk( cudaMemcpyAsync( dev_pointers_in[j],
image_vector[j],
NX*NY*NZ*sizeof(SimPixelType),
cudaMemcpyHostToDevice,
streams_fft[j]) );
gpuErrchk( cudaMemcpyAsync( dev_pointers_out[j],
out,
NX*NY*NZ*sizeof(cufftDoubleComplex),
cudaMemcpyHostToDevice,
streams_fft[j] ) );
cufftExecD2Z( planr2c[j],
(SimPixelType*)dev_pointers_in[j],
(cufftDoubleComplex*)dev_pointers_out[j]);
}
Then I changed my code so that I finished all memory copies (synchronize) and send all kernels to streams at once and I got the following profiling result:
Then I was confirmed that the kernels were not running in a concurrent way.
I looked at one link which explains in details how to setup to utilize full concurrency by either passing "–default-stream per-thread" command line argument or #define CUDA_API_PER_THREAD_DEFAULT_STREAM before you #include or in your code. It is a feature introduced in CUDA 7. I ran the sample code in the above link on my MacBook Pro Retina 15' with GeForce GT750M (The same machine used as in the above link), And I was able to get concurrent kernel runs. But I was not able to get my cuFFT kernels running in parallel.
Then I found this link with someone saying that cuFFT kernel will occupy the whole GPU so no two cuFFT kernels running parallel. Then I was stuck. Since I didn't find any formal documentation addressing whether CUFFT enables concurrent kernels. It this true? Is there a way to get around with this?

I assume you called cufftSetStream() prior to the code you have shown, appropriate for each planr2c[j], so that each plan is associated with a separate stream. I don't see it in the code you posted. If you actually want cufft kernels to overlap with other cufft kernels, it's necessary for those kernels to be launched to separate streams. So the cufft exec call for image 0 would have to be launched into a different stream than the cufft exec call for image 1, for example.
In order for any two CUDA operations to have the possibility to overlap, they must be launched into different streams.
Having said that, concurrent memory copies with kernel execution, but not concurrent kernels, is about what I would expect for reasonable sized FFTs.
A 128x128 FFT to a first order approximation will spin up ~15,000 threads, so if my thread blocks are ~500 threads each, that would be 30 threadblocks, which will keep a GPU fairly occupied, leaving not much "room" for additional kernels. (You can actually discover the total blocks and threads for a kernel in the profiler itself.) Your GT750m probably has 2 Kepler SMs with a maximum of 16 blocks per SM so a max instantaneous capacity of 32 blocks. And this capacity number could be reduced for a specific kernel due to shared memory usage, register usage, or other factors.
The instantaneous capacity of whatever GPU you are running on (max blocks per SM * number of SMs) will determine the potential for overlap (concurrency) of kernels. If you exceed that capacity with a single kernel launch, then that will "fill" the GPU, preventing kernel concurrency for some time period.
It should be theoretically possible for CUFFT kernels to run concurrently. But just like any kernel concurrency scenario, CUFFT or otherwise, the resource usage of those kernels would have to be pretty low to actually witness concurrency. Typically when you have low resource usage, it implies kernels with a relatively small number of threads/threadblocks. These kernels don't usually take long to execute, making it even more difficult to actually witness concurrency (because launch latency and other latency factors may get in the way). The easiest way to witness concurrent kernels is to have kernels with unusually low resource requirements combined with unusually long run times. This is generally not the typical scenario, for CUFFT kernels or any other kernels.
Overlap of copy and compute is a still a useful feature of streams with CUFFT. And the concurrency idea, without a basis of understanding of the machine capacity and resource constraints, is somewhat unreasonable in itself. For example, if kernel concurrency was an arbitrary achievable ("I should be able to make any 2 kernels run concurrently"), without consideration to capacity or resource specifics, then after you get two kernels running concurrently, the next logical step would be to go to 4, 8, 16 kernels concurrently. But the reality is that the machine can't handle that much work simultaneously. Once you've exposed enough parallelism (loosely translated as "enough threads") in a single kernel launch, exposing additional work parallelism via additional kernel launches normally cannot make the machine run any faster, or process the work quicker.

Open CL Running parallel tasks on data parallel kernel

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.

The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.

If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

Intel Tbb overhead issue

Im using Intel TBB to parallel processing some parts of an algorithm processed on images. Although the processing for each pixel is data dependent, there are some cases which 2 consecutive pixels could be processed in parallel as below.
ProcessImage(image)
for each row in image // Create and wait root task for each line here using allocate_root()
ProcessRow(row)
for each 2 pixel
if(parallel())
ProcessPixel(A) and ProcessPixel(B) in parallel // For testing, create and process 2 tbb::empty_task() here as child tasks
else
ProcessPixel(A)
ProcessPixel(B)
However, the overhead occurs because this processing is very fast. For each input image (size of 512x512), the processing costs about 5-6 ms.
When I experimentally used Intel TBB as comment block above, the processing costs more than 25 ms.
So is there any better way using Intel TBB without overhead issue or other more efficient way to improve performance of simple and fast processing program like this ?

TBB does not add such a big (~20ms) overheads for invocation of a parallel algorithm. My guess (since there is no specifics provided) is that it is related to one of the following:
If you measure only the first invocation, it includes overheads for worker threads creation. And note, TBB does not have barriers like OpenMP, so one call to parallel_for might not be enough to create all the threads)
Same situation happens after worker threads go to sleep because of absence of the parallel work for them. The overheads for the wakeup are orders of magnitude lower than for the threads creation but still can affect measurements and impose wrong conclusions.
TBB scheduler can steal a task from outer level to the nested level (blocking call) thus the measurements will look like it takes too long for processing the nested part only while it is busy with an extra work there.
There is a contention for processing (A) and (B) in parallel caused by either explicit (e.g. mutex) or implicit (e.g. false sharing) reasons. But anyway, it is not TBB-specific.
Thus, the advice for performance measurements with TBB is to consider only the total time for long enough sequence of computations that will hide initialization overheads.
And of course as was advised, parallel first on the outer level. TBB provides enough different patterns for that including tbb::parallel_pipeline and tbb::flow::graph

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/

Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page

My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js