concurrent kernel execution - concurrency

Is it possible to launch kernels from different threads of a (host) application and have them run concurrently on the same GPGPU device? If not, do you know of any plans (of Nvidia) to provide this capability in the future?

The programming guide http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf says:
3.2.7.3 Concurrent Kernel Execution
Some devices of compute capability 2.0 can execute multiple kernels concurrently. Applications may query this capability by calling cudaGetDeviceProperties() and checking the concurrentKernels property.
The maximum number of kernel launches that a device can execute concurrently is sixteen.
So the answer is: It depends. It actually depends only on the device. Host threads won't make a difference in any way. Concurrent kernel launches are serialized if the device doesn't support concurrent kernel execution and if the device does, serial kernel launches on different streams are executed concurrently.

Related

Does CU_CTX_SCHED_BLOCKING_SYNC make kernels synchronous?

Does creating a CUDA context with CU_CTX_SCHED_BLOCKING_SYNC make CUDA kernel launches actually synchronous (i.e. stalling the CPU thread as a normal CPU same-thread function would)?
Documentation only states
CU_CTX_SCHED_BLOCKING_SYNC: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the GPU to finish work.
but I'm not sure I understood it right.
No.
These flags control how the host thread will behave when a host<->device synchronization API like cuCtxSynchronize , cuEventSynchronize, or cuStreamSynchonize are called using the host API. Other non-blocking API calls are asynchronous in both cases.
There are two models of host behaviour, blocking or yielding. Blocking means the calling host thread will spin while waiting for the call to return and block access to the driver by other threads, yield means it can yield to other host threads trying to interact with the GPU driver.
If you want to enforce blocking behaviour on kernel launch, use the CUDA_​LAUNCH_​BLOCKING environment variable.

Creating a cuda stream on each host thread (multi-threaded CPU)

I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.
I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."
So if I want to implement this I would do something like
void main(int CPU_ThreadID) {
cudaStream_t *stream;
cudaStreamCreate(&stream);
int *d_a;
int *a;
cudaMalloc((void**)&d_a, 100*sizeof(int));
cudaMallocHost((void**)&a, 100*8*sizeof(int));
cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
sum<<<100,32,0,stream>>>(d_a);
cudaStreamDestroy(stream);
}
That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!
Edit:
I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.
It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.
A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.
Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.
Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.
Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.
You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)
Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.

Concurrent kernels execution

i want to know if using multiple cuda streams,provides the concurrency execution of the kernels ? or it provides only the Concurrency of copy and kernel execution ?
accually i'm looking for a solution to execute multiple Concurrent kernels.
Can anyone helps me ?
CUDA streams are required for most types of asynchronous concurrent execution, except host/device concurrency. Memcpy/compute overlap and concurrent kernels require streams.
Many folks have the mistaken idea that they can use concurrent kernel execution to run arbitrary kernels in parallel. But concurrent kernel execution generally is only visible when the kernels to be executed are small in terms of their resource usage (blocks, registers, shared memory). A kernel that uses a lot of threadblocks, a lot of registers, or a lot of shared memory may not run concurrently with other kernels -- because it is utilizing the entire machine by itself.
You can get started with concurrent kernel execution by studying and running the concurrent kernels sample in the CUDA sample codes.

CUDA transfer memory during kernel execution

I know that CUDA kernels can be "overlapped" by putting them into separate streams, but I'm wondering if would it be possible to transfer memory during kernel executions. CUDA kernels are asynchronous afterall
You can run kernels, transfers from host to device and transfers from device to host concurrently.
http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
Just for clarification, the above are valid only if your device supports it. You can check it running device query and checking the attribute concurrent copy and execution

Kernel threads in posix

As far as I understand, the kernel has kernelthreads for each core in a computer and threads from the userspace are scheduled onto these kernel threads (The OS decides which thread from an application gets connected to which kernelthread). Lets say I want to create an application that uses X number of cores on a computer with X cores. If I use regular pthreads, I think it would be possible that the OS decides to have all the threads I created to be scheduled onto a single core. How can I ensure that each each thread is one-on-one with the kernelthreads?
You should basically trust the kernel you are using (in particular, because there could be another heavy process running; the kernel scheduler will choose tasks to be run during a quantum of time).
Perhaps you are interested in CPU affinity, with non-portable functions like pthread_attr_setaffinity_np
You're understanding is a bit off. 'kernelthreads' on Linux are basically kernel tasks that are scheduled alongside other processes and threads. When the kernel's scheduler runs, the scheduling algorithm decides which process/thread, out of the pool of runnable threads, will be scheduled to run next on a given CPU core. As #Basile Starynkevitch mentioned, you can tell the kernel to pin individual threads from your application to a particular core, which means the operating system's scheduler will only consider running it on that core, along with other threads that are not pinned to a particular core.
In general with multithreading, you don't want your number of threads to be equal to your number of cores, unless you're doing exclusively CPU-bound processing, you want number of threads > number of cores. When waiting for network or disk IO (i.e. when you're waiting in an accept(2), recv(2), or read(2)) you're thread is not considered runnable. If N threads > N cores, the operating system may be able to schedule a different thread of yours to do work while waiting for that IO.
What you mention is one possible model to implement threading. But such a hierarchical model may not be followed at all by a given POSIX thread implementation. Since somebody already mentioned linux, it dosn't have it, all threads are equal from the point of view of the scheduler, there. They compete for the same resources if you don't specify something extra.
Last time I have seen such a hierarchical model was on a machine with an IRIX OS, long time ago.
So in summary, there is no general rule under POSIX for that, you'd have to look up the documentation of your particular OS or ask a more specific question about it.