Simultaneous launch of Multiple Kernels using CUDA for a GPU - concurrency

Is it possible to launch two kernels that do independent tasks, simultaneously. For example if I have this Cuda code
// host and device initialization
.......
.......
// launch kernel1
myMethod1 <<<.... >>> (params);
// launch kernel2
myMethod2 <<<.....>>> (params);
Assuming that these kernels are independent, is there a facility to launch them at the same time allocating few grids/blocks for each. Does CUDA/OpenCL have this provision.

Only devices with CUDA compute capability 2.0 and better (i.e. Fermi) can support multiple simultaneous kernel executions. See section 3.2.6.3 of the CUDA 3.0 programming guide, which states:
Some devices of compute capability 2.0
can execute multiple kernels
concurrently. Applications may query
this capability by calling
cudaGetDeviceProperties() and checking
the concurrentKernels property.
The maximum number of kernel launches
that a device can execute concurrently
is four.
A kernel from one CUDA context cannot
execute concurrently with a kernel
from another CUDA context.
Kernels that use many textures or a
large amount of local memory are less
likely to execute concurrently with
other kernels.

You will need SM 2.0 or above for concurrent kernels.
To get concurrent execution you need to manually indicate that there is no dependence between the two kernels. This is because the compiler cannot determine that one kernel will not modify data being used in the other, this could be by reading from and writing to the same buffer which seems simple enough, but is actually much harder to detect since there can be pointers inside data structures and so on.
To express the independence you must launch the kernels in different streams. The fourth parameter in the triple-chevron syntax specifies the stream, check out the Programming Guide or the SDK concurrentKernels sample.

CUDA compatibility 2.1 = up to 16 Concurrent Kernels

Related

Concurrently run two different algorithms with OpenCL

I want to run two different algorithms on the same device at the same time assuming that my device has 2 compute units. Is that possible by just creating 2 different kernels, 2 programs and 2 command queues?
I've tried to test this, but it seems like the second kernel doesn't execute, so I'm wondering if this is even possible?
In the Nvidia OpenCL Programming Guide, I've read that:
"For devices of compute capability 2.x and higher, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently."
Now I'm not sure if this really means I can run multiple DIFFERENT kernels or just multiple instances of the same kernel (and those would be just simple old work items).
I've also checked that my Nvidia GeForce GT 525M has a compute capability of 2.1.
Yes, it's perfectly legal to have multiple command queues on the same device, running different kernels.
Both kernels should definitely execute, even if the device doesn't support concurrent kernel execution (then they will just run sequentially). You may have a bug in your code. Remember that you will have to clFinish both command queues independently (or have an event for each kernel invocation).
What may be hard is getting the two kernels to actually run concurrently. Even if the device supports that behavior in hardware, the kernels could still run sequentially, if any of the resources are not sufficient to run them simultaneously.
If your two algorithms can each by themselves run sufficiently parallel to use most of the device resources, it's easier to just run them in the same command queue after one another than trying to parallelize the execution.

Concurrency of cuFFT streams

So I am using cuFFT combined with the CUDA stream feature. The problem I have is that I can't seem to make the cuFFT kernels run in full concurrency. The following is the results I have from nvvp. Each of the stream is running a kernel of 2D batch FFT on 128 images of size 128x128. I setup 3 streams to run 3 independent FFT batch plan.
As can be seen from the figure, some memory copies (yellow bars) were in concurrent with some kernel computations (purple, brown and pink bars). But the kernels runs were not in concurrent at all. As you notice each kernel was strictly following each other. The following is the code I used for memory copy to the device and kernel launching.
for (unsigned int j = 0; j < NUM_IMAGES; j++ ) {
gpuErrchk( cudaMemcpyAsync( dev_pointers_in[j],
image_vector[j],
NX*NY*NZ*sizeof(SimPixelType),
cudaMemcpyHostToDevice,
streams_fft[j]) );
gpuErrchk( cudaMemcpyAsync( dev_pointers_out[j],
out,
NX*NY*NZ*sizeof(cufftDoubleComplex),
cudaMemcpyHostToDevice,
streams_fft[j] ) );
cufftExecD2Z( planr2c[j],
(SimPixelType*)dev_pointers_in[j],
(cufftDoubleComplex*)dev_pointers_out[j]);
}
Then I changed my code so that I finished all memory copies (synchronize) and send all kernels to streams at once and I got the following profiling result:
Then I was confirmed that the kernels were not running in a concurrent way.
I looked at one link which explains in details how to setup to utilize full concurrency by either passing "–default-stream per-thread" command line argument or #define CUDA_API_PER_THREAD_DEFAULT_STREAM before you #include or in your code. It is a feature introduced in CUDA 7. I ran the sample code in the above link on my MacBook Pro Retina 15' with GeForce GT750M (The same machine used as in the above link), And I was able to get concurrent kernel runs. But I was not able to get my cuFFT kernels running in parallel.
Then I found this link with someone saying that cuFFT kernel will occupy the whole GPU so no two cuFFT kernels running parallel. Then I was stuck. Since I didn't find any formal documentation addressing whether CUFFT enables concurrent kernels. It this true? Is there a way to get around with this?
I assume you called cufftSetStream() prior to the code you have shown, appropriate for each planr2c[j], so that each plan is associated with a separate stream. I don't see it in the code you posted. If you actually want cufft kernels to overlap with other cufft kernels, it's necessary for those kernels to be launched to separate streams. So the cufft exec call for image 0 would have to be launched into a different stream than the cufft exec call for image 1, for example.
In order for any two CUDA operations to have the possibility to overlap, they must be launched into different streams.
Having said that, concurrent memory copies with kernel execution, but not concurrent kernels, is about what I would expect for reasonable sized FFTs.
A 128x128 FFT to a first order approximation will spin up ~15,000 threads, so if my thread blocks are ~500 threads each, that would be 30 threadblocks, which will keep a GPU fairly occupied, leaving not much "room" for additional kernels. (You can actually discover the total blocks and threads for a kernel in the profiler itself.) Your GT750m probably has 2 Kepler SMs with a maximum of 16 blocks per SM so a max instantaneous capacity of 32 blocks. And this capacity number could be reduced for a specific kernel due to shared memory usage, register usage, or other factors.
The instantaneous capacity of whatever GPU you are running on (max blocks per SM * number of SMs) will determine the potential for overlap (concurrency) of kernels. If you exceed that capacity with a single kernel launch, then that will "fill" the GPU, preventing kernel concurrency for some time period.
It should be theoretically possible for CUFFT kernels to run concurrently. But just like any kernel concurrency scenario, CUFFT or otherwise, the resource usage of those kernels would have to be pretty low to actually witness concurrency. Typically when you have low resource usage, it implies kernels with a relatively small number of threads/threadblocks. These kernels don't usually take long to execute, making it even more difficult to actually witness concurrency (because launch latency and other latency factors may get in the way). The easiest way to witness concurrent kernels is to have kernels with unusually low resource requirements combined with unusually long run times. This is generally not the typical scenario, for CUFFT kernels or any other kernels.
Overlap of copy and compute is a still a useful feature of streams with CUFFT. And the concurrency idea, without a basis of understanding of the machine capacity and resource constraints, is somewhat unreasonable in itself. For example, if kernel concurrency was an arbitrary achievable ("I should be able to make any 2 kernels run concurrently"), without consideration to capacity or resource specifics, then after you get two kernels running concurrently, the next logical step would be to go to 4, 8, 16 kernels concurrently. But the reality is that the machine can't handle that much work simultaneously. Once you've exposed enough parallelism (loosely translated as "enough threads") in a single kernel launch, exposing additional work parallelism via additional kernel launches normally cannot make the machine run any faster, or process the work quicker.

Multiple CUDA contexts for one device - any sense?

I thought I had the grasp of this but apparently I do not:) I need to perform parallel H.264 stream encoding with NVENC from frames that are not in any of the formats accepted by the encoder so I have a following code pipeline:
A callback informing that a new frame has arrived is called
I copy the frame to CUDA memory and perform the needed color space conversions (only the first cuMemcpy is synchronous, so I can return from the callback, all pending operations are pushed in a dedicated stream)
I push an event onto the stream and have another thread waiting for it, as soon as it is set I take the CUDA memory pointer with the frame in the correct color space and feed it to the decoder
For some reason I had the assumption that I need a dedicated context for each thread if I perform this pipeline in parallel threads. The code was slow and after some reading I understood that the context switching is actually expensive, and then I actually came to the conclusion that it makes no sense since in a context owns the whole GPU so I lock out any parallel processing from other transcoder threads.
Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?
Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
You should be fine with a single context.
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?
The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process' use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.
If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you've discovered, it's possible to create multiple contexts from a single process, but not usually necessary.
And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.
CUDA runtime API usage manages contexts for you. You normally don't explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.
Obviously a few years have passed, but NVENC/NVDEC now appear to have CUstream support as of version 9.1 (circa September 2019) of the video codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk/download
NEW to 9.1- Encode: CUStream support in NVENC for enhanced parallelism between CUDA pre-processing and NVENC encoding
I'm super new to CUDA, but my basic understanding is that CUcontexts allow multiple processes to use the GPU (by doing context swaps that interrupt each other's work), while CUstreams allow for a coordinated sharing of the GPU's resources from within a single process.

CUDA atomic operations and concurrent kernel launch

Currently I develop a GPU-based program that use multiple kernels that are launched concurrently by using multiple streams.
In my application, multiple kernels need to access a queue/stack
and I have plan to use atomic operations.
But I do not know whether atomic operations work between multiple kernels concurrently launched.
Please help me anyone who know the exact mechanism of the atomic operations on GPU
or who has experience with this issue.
Atomics are implemented in the L2 cache hardware of the GPU, through which all memory operations must pass. There is no hardware to ensure coherency between host and device memory, or between different GPUs; but as long as the kernels are running on the same GPU and using device memory on that GPU to synchronize, atomics will work as expected.

Windows multitasking breaks OpenCL perfomance

I'm writing Qt application with simple idea: there are several OpenCL-capable devices, each of them gets own control thread which preparing data, executing OpenCL kernel and processing results. OpenCL code is actually bitcoin mining kernel (for now it's this one, but it doesn't matter).
When working with 2 GPUs everything is ok.
When I use GPU and CPU there is a problem. CPU works at reasonable speed, but GPU slowing down to zero perfomance.
There are no such promblem under Linux. Under Windows, poclbm behaves in the same way: when starting multiple instances (1 for GPU, 1 for CPU), GPU perfomance is 0.
I'm not sure about which part of code I should post, so it will be helpfull. I can only mention, that thread is a QThread's child with run() reimplemented with a busy loop while( !_stop ) { mineBitcoins(); }. Logic of that loop is pretty much copied from poclbm's BitcoinMiner::mining_thread (here).
In which direction should I dig? Thanks.
upd:
I'm using QtOpenCL with AMD APP SDK.
If you run the kernel on the CPU with full utilization of all cores, the threads that handle the other devices might not be able to keep up with the GPU, effectively limiting performance.
Try decreasing the number of threads running the kernel on the CPU, e.g. if your program runs on a quad-core with hyper threading, limit the threads to 7.
don't use the host device as opencl device. If you really have too, restrict the amount of compute units (of the CPU used as host) allocated for CL by creating a subdevice.
I don't know if you are using the both devices in the same context. But if that is the case, the memory consistency inside a context can be your problem and how the different OpenCL implementation handle it.
OpenCL tries to mantain the memory inside a context updated (at least in Windows), and can cause the GPU to continuosly copy the memory used back to "CPU-side".
I tryed that long ago and resulted as in your case with "~=0 performance in the GPU".