Concurrently run two different algorithms with OpenCL - concurrency

I want to run two different algorithms on the same device at the same time assuming that my device has 2 compute units. Is that possible by just creating 2 different kernels, 2 programs and 2 command queues?
I've tried to test this, but it seems like the second kernel doesn't execute, so I'm wondering if this is even possible?
In the Nvidia OpenCL Programming Guide, I've read that:
"For devices of compute capability 2.x and higher, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently."
Now I'm not sure if this really means I can run multiple DIFFERENT kernels or just multiple instances of the same kernel (and those would be just simple old work items).
I've also checked that my Nvidia GeForce GT 525M has a compute capability of 2.1.

Yes, it's perfectly legal to have multiple command queues on the same device, running different kernels.
Both kernels should definitely execute, even if the device doesn't support concurrent kernel execution (then they will just run sequentially). You may have a bug in your code. Remember that you will have to clFinish both command queues independently (or have an event for each kernel invocation).
What may be hard is getting the two kernels to actually run concurrently. Even if the device supports that behavior in hardware, the kernels could still run sequentially, if any of the resources are not sufficient to run them simultaneously.
If your two algorithms can each by themselves run sufficiently parallel to use most of the device resources, it's easier to just run them in the same command queue after one another than trying to parallelize the execution.

Related

Concurrency of cuFFT streams

So I am using cuFFT combined with the CUDA stream feature. The problem I have is that I can't seem to make the cuFFT kernels run in full concurrency. The following is the results I have from nvvp. Each of the stream is running a kernel of 2D batch FFT on 128 images of size 128x128. I setup 3 streams to run 3 independent FFT batch plan.
As can be seen from the figure, some memory copies (yellow bars) were in concurrent with some kernel computations (purple, brown and pink bars). But the kernels runs were not in concurrent at all. As you notice each kernel was strictly following each other. The following is the code I used for memory copy to the device and kernel launching.
for (unsigned int j = 0; j < NUM_IMAGES; j++ ) {
gpuErrchk( cudaMemcpyAsync( dev_pointers_in[j],
image_vector[j],
NX*NY*NZ*sizeof(SimPixelType),
cudaMemcpyHostToDevice,
streams_fft[j]) );
gpuErrchk( cudaMemcpyAsync( dev_pointers_out[j],
out,
NX*NY*NZ*sizeof(cufftDoubleComplex),
cudaMemcpyHostToDevice,
streams_fft[j] ) );
cufftExecD2Z( planr2c[j],
(SimPixelType*)dev_pointers_in[j],
(cufftDoubleComplex*)dev_pointers_out[j]);
}
Then I changed my code so that I finished all memory copies (synchronize) and send all kernels to streams at once and I got the following profiling result:
Then I was confirmed that the kernels were not running in a concurrent way.
I looked at one link which explains in details how to setup to utilize full concurrency by either passing "–default-stream per-thread" command line argument or #define CUDA_API_PER_THREAD_DEFAULT_STREAM before you #include or in your code. It is a feature introduced in CUDA 7. I ran the sample code in the above link on my MacBook Pro Retina 15' with GeForce GT750M (The same machine used as in the above link), And I was able to get concurrent kernel runs. But I was not able to get my cuFFT kernels running in parallel.
Then I found this link with someone saying that cuFFT kernel will occupy the whole GPU so no two cuFFT kernels running parallel. Then I was stuck. Since I didn't find any formal documentation addressing whether CUFFT enables concurrent kernels. It this true? Is there a way to get around with this?
I assume you called cufftSetStream() prior to the code you have shown, appropriate for each planr2c[j], so that each plan is associated with a separate stream. I don't see it in the code you posted. If you actually want cufft kernels to overlap with other cufft kernels, it's necessary for those kernels to be launched to separate streams. So the cufft exec call for image 0 would have to be launched into a different stream than the cufft exec call for image 1, for example.
In order for any two CUDA operations to have the possibility to overlap, they must be launched into different streams.
Having said that, concurrent memory copies with kernel execution, but not concurrent kernels, is about what I would expect for reasonable sized FFTs.
A 128x128 FFT to a first order approximation will spin up ~15,000 threads, so if my thread blocks are ~500 threads each, that would be 30 threadblocks, which will keep a GPU fairly occupied, leaving not much "room" for additional kernels. (You can actually discover the total blocks and threads for a kernel in the profiler itself.) Your GT750m probably has 2 Kepler SMs with a maximum of 16 blocks per SM so a max instantaneous capacity of 32 blocks. And this capacity number could be reduced for a specific kernel due to shared memory usage, register usage, or other factors.
The instantaneous capacity of whatever GPU you are running on (max blocks per SM * number of SMs) will determine the potential for overlap (concurrency) of kernels. If you exceed that capacity with a single kernel launch, then that will "fill" the GPU, preventing kernel concurrency for some time period.
It should be theoretically possible for CUFFT kernels to run concurrently. But just like any kernel concurrency scenario, CUFFT or otherwise, the resource usage of those kernels would have to be pretty low to actually witness concurrency. Typically when you have low resource usage, it implies kernels with a relatively small number of threads/threadblocks. These kernels don't usually take long to execute, making it even more difficult to actually witness concurrency (because launch latency and other latency factors may get in the way). The easiest way to witness concurrent kernels is to have kernels with unusually low resource requirements combined with unusually long run times. This is generally not the typical scenario, for CUFFT kernels or any other kernels.
Overlap of copy and compute is a still a useful feature of streams with CUFFT. And the concurrency idea, without a basis of understanding of the machine capacity and resource constraints, is somewhat unreasonable in itself. For example, if kernel concurrency was an arbitrary achievable ("I should be able to make any 2 kernels run concurrently"), without consideration to capacity or resource specifics, then after you get two kernels running concurrently, the next logical step would be to go to 4, 8, 16 kernels concurrently. But the reality is that the machine can't handle that much work simultaneously. Once you've exposed enough parallelism (loosely translated as "enough threads") in a single kernel launch, exposing additional work parallelism via additional kernel launches normally cannot make the machine run any faster, or process the work quicker.

Open CL Running parallel tasks on data parallel kernel

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.
The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.
If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

Is it possible to execute multiple instances of a CUDA program on a multi-GPU machine?

Background:
I have written a CUDA program that performs processing on a sequence of symbols. The program processes all sequences of symbols in parallel with the stipulation that all sequences are of the same length. I'm sorting my data into groups with each group consisting entirely of sequences of the same length. The program processes 1 group at a time.
Question:
I am running my code on a Linux machine with 4 GPUs and would like to utilize all 4 GPUs by running 4 instances of my program (1 per GPU). Is it possible to have the program select a GPU that isn't in use by another CUDA application to run on? I don't want to hardcode anything that would cause problems down the road when the program is run on different hardware with a greater or fewer number of GPUs.
The environment variable CUDA_VISIBLE_DEVICES is your friend.
I assume you have as many terminals open as you have GPUs. Let's say your application is called myexe
Then in one terminal, you could do:
CUDA_VISIBLE_DEVICES="0" ./myexe
In the next terminal:
CUDA_VISIBLE_DEVICES="1" ./myexe
and so on.
Then the first instance will run on the first GPU enumerated by CUDA. The second instance will run on the second GPU (only), and so on.
Assuming bash, and for a given terminal session, you can make this "permanent" by exporting the variable:
export CUDA_VISIBLE_DEVICES="2"
thereafter, all CUDA applications run in that session will observe only the third enumerated GPU (enumeration starts at 0), and they will observe that GPU as if it were device 0 in their session.
This means you don't have to make any changes to your application for this method, assuming your app uses the default GPU or GPU 0.
You can also extend this to make multiple GPUs available, for example:
export CUDA_VISIBLE_DEVICES="2,4"
means the GPUs that would ordinarily enumerate as 2 and 4 would now be the only GPUs "visible" in that session and they would enumerate as 0 and 1.
In my opinion the above approach is the easiest. Selecting a GPU that "isn't in use" is problematic because:
we need a definition of "in use"
A GPU that was in use at a particular instant may not be in use immediately after that
Most important, a GPU that is not "in use" could become "in use" asynchronously, meaning you are exposed to race conditions.
So the best advice (IMO) is to manage the GPUs explicitly. Otherwise you need some form of job scheduler (outside the scope of this question, IMO) to be able to query unused GPUs and "reserve" one before another app tries to do so, in an orderly fashion.
There is a better (more automatic) way, which we use in PIConGPU that is run on huge (and different) clusters.
See the implementation here: https://github.com/ComputationalRadiationPhysics/picongpu/blob/909b55ee24a7dcfae8824a22b25c5aef6bd098de/src/libPMacc/include/Environment.hpp#L169
Basically: Call cudaGetDeviceCount to get the number of GPUs, iterate over them and call cudaSetDevice to set this as the current device and check, if that worked. This check could involve test-creating a stream due to some bug in CUDA which made the setDevice succeed but all later calls failed as the device was actually in use.
Note: You may need to set the GPUs to exclusive-mode so a GPU can only be used by one process. If you don't have enough data of one "batch" you may want the opposite: Multiple process submit work to one GPU. So tune according to your needs.
Other ideas are: Start a MPI-application with the same number of processes per rank as there are GPUs and use the same device number as the local rank number. This would also help in applications like yours that have different datasets to distribute. So you can e.g. have MPI rank 0 process length1-data and MPI rank 1 process length2-data etc.

parallel processing library

I would like to know which parallel processing library to be best used under these configurations:
A single quad core machine. I would like to execute four functions of the same type on each core. The same function takes different arguments.
A cluster of 4 machines with each one with multi core. I would like to execute the same functions but n-parallel ( 4 machines * no of cores in each machine ). So I want it to scale.
Program details :
C++ program. There is no dependency between functions. The same function gets executed with different set of inputs and gets completed for > 100 times
There is no shared memory as each function takes its own data and its own inputs.
Each function need not to wait for others to complete. There is no need of join or fork.
For above scenarios what is the best parallel libs can be used? MPI, BOOST::MPI, open mp or other libs.
My preference would be BOOST::MPI but I want some recommendations. I am not sure if using MPI is allowed with parallel multi core machines?
Thanks.
What you have here is an embarassingly parallel problem (http://en.wikipedia.org/wiki/Embarrassingly_parallel). While MPI can definitely be used on a multi-core machine, it could be over kill for the problem at hand. If your tasks are completely separated, you could just compile them in to separate executables or a single executable with different inputs and use "make -j [n]" (see http://www.gnu.org/software/make/manual/html_node/Parallel.html) to execute them in parallel.
If MPI comes naturally to you, by all means, use it. OpenMP probably won't cut it if you want to control computing on separate computers within a cluster.

Simultaneous launch of Multiple Kernels using CUDA for a GPU

Is it possible to launch two kernels that do independent tasks, simultaneously. For example if I have this Cuda code
// host and device initialization
.......
.......
// launch kernel1
myMethod1 <<<.... >>> (params);
// launch kernel2
myMethod2 <<<.....>>> (params);
Assuming that these kernels are independent, is there a facility to launch them at the same time allocating few grids/blocks for each. Does CUDA/OpenCL have this provision.
Only devices with CUDA compute capability 2.0 and better (i.e. Fermi) can support multiple simultaneous kernel executions. See section 3.2.6.3 of the CUDA 3.0 programming guide, which states:
Some devices of compute capability 2.0
can execute multiple kernels
concurrently. Applications may query
this capability by calling
cudaGetDeviceProperties() and checking
the concurrentKernels property.
The maximum number of kernel launches
that a device can execute concurrently
is four.
A kernel from one CUDA context cannot
execute concurrently with a kernel
from another CUDA context.
Kernels that use many textures or a
large amount of local memory are less
likely to execute concurrently with
other kernels.
You will need SM 2.0 or above for concurrent kernels.
To get concurrent execution you need to manually indicate that there is no dependence between the two kernels. This is because the compiler cannot determine that one kernel will not modify data being used in the other, this could be by reading from and writing to the same buffer which seems simple enough, but is actually much harder to detect since there can be pointers inside data structures and so on.
To express the independence you must launch the kernels in different streams. The fourth parameter in the triple-chevron syntax specifies the stream, check out the Programming Guide or the SDK concurrentKernels sample.
CUDA compatibility 2.1 = up to 16 Concurrent Kernels