i want to know if using multiple cuda streams,provides the concurrency execution of the kernels ? or it provides only the Concurrency of copy and kernel execution ?
accually i'm looking for a solution to execute multiple Concurrent kernels.
Can anyone helps me ?
CUDA streams are required for most types of asynchronous concurrent execution, except host/device concurrency. Memcpy/compute overlap and concurrent kernels require streams.
Many folks have the mistaken idea that they can use concurrent kernel execution to run arbitrary kernels in parallel. But concurrent kernel execution generally is only visible when the kernels to be executed are small in terms of their resource usage (blocks, registers, shared memory). A kernel that uses a lot of threadblocks, a lot of registers, or a lot of shared memory may not run concurrently with other kernels -- because it is utilizing the entire machine by itself.
You can get started with concurrent kernel execution by studying and running the concurrent kernels sample in the CUDA sample codes.
Related
Currently I am building a Tensorflow Graph and then performing, graph->getSession()->Run(input,model,output) in C++ in CPU
I want to achieve concurrency. What are my options to execute in parallel so I can support multiple requests executed concurrently.
Can I run sessions in multi threaded way?
By executing multiple sessions parallel will the processing time be constant? Example : If one session takes 100 ms then running 2 sessions concurrently takes approximately 100 ms.
Note: I want to run this on CPU
First thing to note is that tensorflow will use all cores for processing by default. You have some limited manner of control over this via inter and intra op perallelism discussed in this authoritative answer:
Tensorflow: executing an ops with a specific core of a CPU
The second point to note is that a session is thread safe. You can call call it from multiple threads. Each call will see a consistent point-in-time snapshot of the variables as they were when the call began, this is a question I asked once upon a time:
How are variables shared between concurrent `session.run(...)` calls in tensorflow?
The moral:
If you are running lots of small, sequential operations you can run them concurrently against one session and may be able to squeak out some improved performance if you limit tensorflow's use of parallelism. If you are running large operations (such as large matrix multiples, for example) which benefit more from distributed multi-core processing, you don't need to deal with parallelism yourself, tensorflow is already distributing across all CPU cores by default.
Also if your graph dependencies lend themselves to any amount of parallelization tensorflow handles this as well. You can set up profiling to see this in action.
https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d
In TBB, the task_scheduler_init () method, which is often (and should be?) invoked internally, is a deliberate design decision.
However, if we mix TBB and MPI, is it guaranteed to be thread-safe without controlling the number of threads of each MPI process?
For example, say we have 7 cores (with no hyper-threading) and 2 MPI processes. If each process spawns an individual TBB task using 4 threads simultaneously, then there is a conflict which might cause the program to crash at runtime.
I'm a newbie of TBB.
Looking forward to your comments and suggestions!
From Intel TBB runtime perspective it does not matter if it is a MPI process or not. So if you have two processes you will have two independent instances of Intel TBB runtime. I am not sure that I understand the question related to thread-safeness but it should work correctly. However, you may have oversubscription that can lead to performance issues.
In addition, you should check the MPI implementation documentation if you use MPI routines concurrently (from multiple threads) because it can cause some issues.
Generally speaking, this is a two steps tango
MPI binds each task to some resources
the thread runtime (TBB, same things would apply to OpenMP) is generally smart enough to bind threads within the previously provided resources.
bottom line, if MPI binds its tasks to non overlapping resources, then there should be no conflict caused by the TBB runtime.
a typical scenario is to run 2 MPI tasks with 8 OpenMP threads per task on a dual socket octo core box. as long as MPI binds a task to a socket, and the OpenMP runtime is told to bind the threads to cores, performance will be optimal.
I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.
I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."
So if I want to implement this I would do something like
void main(int CPU_ThreadID) {
cudaStream_t *stream;
cudaStreamCreate(&stream);
int *d_a;
int *a;
cudaMalloc((void**)&d_a, 100*sizeof(int));
cudaMallocHost((void**)&a, 100*8*sizeof(int));
cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
sum<<<100,32,0,stream>>>(d_a);
cudaStreamDestroy(stream);
}
That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!
Edit:
I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.
It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.
A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.
Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.
Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.
Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.
You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)
Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.
I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.
Is it possible to launch kernels from different threads of a (host) application and have them run concurrently on the same GPGPU device? If not, do you know of any plans (of Nvidia) to provide this capability in the future?
The programming guide http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf says:
3.2.7.3 Concurrent Kernel Execution
Some devices of compute capability 2.0 can execute multiple kernels concurrently. Applications may query this capability by calling cudaGetDeviceProperties() and checking the concurrentKernels property.
The maximum number of kernel launches that a device can execute concurrently is sixteen.
So the answer is: It depends. It actually depends only on the device. Host threads won't make a difference in any way. Concurrent kernel launches are serialized if the device doesn't support concurrent kernel execution and if the device does, serial kernel launches on different streams are executed concurrently.