CUDA transfer memory during kernel execution - c++

I know that CUDA kernels can be "overlapped" by putting them into separate streams, but I'm wondering if would it be possible to transfer memory during kernel executions. CUDA kernels are asynchronous afterall

You can run kernels, transfers from host to device and transfers from device to host concurrently.
http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Just for clarification, the above are valid only if your device supports it. You can check it running device query and checking the attribute concurrent copy and execution

Related

Does CU_CTX_SCHED_BLOCKING_SYNC make kernels synchronous?

Does creating a CUDA context with CU_CTX_SCHED_BLOCKING_SYNC make CUDA kernel launches actually synchronous (i.e. stalling the CPU thread as a normal CPU same-thread function would)?
Documentation only states
CU_CTX_SCHED_BLOCKING_SYNC: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the GPU to finish work.
but I'm not sure I understood it right.
No.
These flags control how the host thread will behave when a host<->device synchronization API like cuCtxSynchronize , cuEventSynchronize, or cuStreamSynchonize are called using the host API. Other non-blocking API calls are asynchronous in both cases.
There are two models of host behaviour, blocking or yielding. Blocking means the calling host thread will spin while waiting for the call to return and block access to the driver by other threads, yield means it can yield to other host threads trying to interact with the GPU driver.
If you want to enforce blocking behaviour on kernel launch, use the CUDA_​LAUNCH_​BLOCKING environment variable.

Concurrent kernels execution

i want to know if using multiple cuda streams,provides the concurrency execution of the kernels ? or it provides only the Concurrency of copy and kernel execution ?
accually i'm looking for a solution to execute multiple Concurrent kernels.
Can anyone helps me ?
CUDA streams are required for most types of asynchronous concurrent execution, except host/device concurrency. Memcpy/compute overlap and concurrent kernels require streams.
Many folks have the mistaken idea that they can use concurrent kernel execution to run arbitrary kernels in parallel. But concurrent kernel execution generally is only visible when the kernels to be executed are small in terms of their resource usage (blocks, registers, shared memory). A kernel that uses a lot of threadblocks, a lot of registers, or a lot of shared memory may not run concurrently with other kernels -- because it is utilizing the entire machine by itself.
You can get started with concurrent kernel execution by studying and running the concurrent kernels sample in the CUDA sample codes.

How to reduce CUDA synchronize latency / delay

This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?
The main difference between synchronize methods is "polling" and "blocking."
"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.
"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.
The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).

Calls to GPU kernel from a multithreaded C++ application?

I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.

concurrent kernel execution

Is it possible to launch kernels from different threads of a (host) application and have them run concurrently on the same GPGPU device? If not, do you know of any plans (of Nvidia) to provide this capability in the future?
The programming guide http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf says:
3.2.7.3 Concurrent Kernel Execution
Some devices of compute capability 2.0 can execute multiple kernels concurrently. Applications may query this capability by calling cudaGetDeviceProperties() and checking the concurrentKernels property.
The maximum number of kernel launches that a device can execute concurrently is sixteen.
So the answer is: It depends. It actually depends only on the device. Host threads won't make a difference in any way. Concurrent kernel launches are serialized if the device doesn't support concurrent kernel execution and if the device does, serial kernel launches on different streams are executed concurrently.