CUDA concurrent kernel execution behavior and efficency - concurrency

I don't have a CUDA card yet, and I have to focus on OpenCL now. So... I think I better just ask..
1. Are kernels executed in the order I invoke them?
If I invoke A through stream 0, B through stream 1, C through stream 0, D through stream1, E through stream 0, is it ensured that the device sees the kernels in the order A, B, C, D, E?
If I invoke kernels A and B through stream 0, and then invoker C through stream 1, will B block C?Do I have to invoke them in an order A, C, B to allow C running concurrently with A and B?
2. Are there any stalls or penalties if I want kernels to run concurrently?
On AMD cards, inter-queue dependency seems to be very expensive (I may wrong. Actually I hope that I'm wrong. But just no one can tell me weither I'm right or not yet.)
If I have kernels A, B, and C, while A and B are independent and C depends on A and B. On AMD cards, there will be a huge delay if I let D wait on A or B, which make synchronized execution much faster for almost all situations.
What I now understand is that a CUDA card only have one queue for computation. That is I can express dependency with the order I invoke kernels instead of events as on AMD cards. Will it be more efficent or even penalty free?

It depends on the command queue you created. If it is an in-order queue then they are executed in order, in the order you submitted them. If it is an out-of-order queue then the runtime might execute them out of order and perhaps even concurrently. It does not have to. Some devices or drivers don't support out-of-order queues and just treat them as in-order.
Managing an out-of-order command queue moves the dependency burden on the host application; you need to use event objects to build a dependency graph.
Another (I think easier) way to get concurrent execution is to use multiple (likely in-order) command queues. Put independent work in each, and the runtime is allowed to run kernels (one from each) concurrently. It doesn't have to, but if it can, it should.

On newer devices kernels from different stream will be executed out-of-order. I behavior I described in the question would only happen in very old architectures.
A kernel will start executing as soon as possible. Invoking A and B in different streams with B waiting on A doesn't have any obvious difference from Invoking A and B in order in a same stream.

Related

MPI Send and Receive modes for large number of processors

I know there are a ton of questions & answers about the different modes of MPI send and receive out there, but I believe mine is different or I am simply not able to apply these answers to my problem.
Anyway, my scenario is as follows. The code is intended for high performance clusters with potentially thousands of cores, organized into a multi-dimensional grid. In my algorithm, there are two successive operations that need to be carry out, let's call them A and B, where A precedes B. A brief description is as follows:
A: Each processor has multiple buffers. It has to send each of these buffers to a certain set of processors. For each buffer to be sent, the set of receiving processors might differ. Sending is the last step of operation A.
B: Each processor receives a set of buffers from a set of processors. Operation B will then work on these buffers once it has received all of them. The result of that operation will be stored in a fixed location (neither in the send or receive buffers)
The following properties are also given:
in A, every processor can compute which processors to send to, and it can also compute a corresponding tag in case that a processor receives multiple buffers from the same sending processor (which is very likely).
in B, every processor can also compute which processors it will receive from, and the corresponding tags that the messages were sent with.
Each processor has its own send buffers and receive buffers, and these are disjoint (i.e. there is no processor that uses its send buffer as a receive buffer as well, and vice versa).
A and B are executed in a loop among other operations before A and after B. We can ensure that the send buffer will not be used again until the next loop iteration, where it is filled with new data in A, and the receive buffers will also not be used again until the next iteration where they are used to receive new data in operation B.
The transition between A and B should, if possible, be a synchronization point, i.e. we want to ensure that all processors enter B at the same time
To send and receive, both A and B have to use (nested) loops on their own to send and receive the different buffers. However, we cannot make any assumption about the order of these send and receive statements, i.e. for any two buffers buf0 and buf1 we cannot guarantee that if buf0 is received by some processor before buf1, that buf0 was also sent before buf1. Note that at this point, using group operations like MPI_Broadcast etc. is not an option yet due to the complexity of determining the set of receiving/sending processors.
Question: Which send and receive modes should I use? I have read a lot of different stuff about these different modes, but I cannot really wrap my head around them. The most important property is deadlock-freedom, and the next important thing is performance. I am tending towards using MPI_Isend() in A without checking the request status and again using the non-blocking MPI_IRecv() in B's loops, and then using MPI_Waitall() to ensure that all buffers were received (and as a consequence, also all buffers have been sent and the processors are synchronized).
Is this the correct approach, or do I have to use buffered sends or something entirely different? I don't have a ton of experience in MPI and the documentation does not really help me much either.
From how you describe your problem, I think MPI_Isend is likely to be the best (only?) option for A because it's guaranteed non-blocking, whereas MPI_Send may be non-blocking, but only if it is able to internally buffer your send buffer.
You should then be able to use an MPI_Barrier so that all processes enter B at the same time. But this might be a performance hit. If you don't insist that all processes enter B at the same time, some can begin receiving messages sooner. Furthermore, given your disjoint send & receive buffers, this should be safe.
For B, you can use MPI_Irecv or MPI_Recv. MPI_Irecv is likely to be faster, because a standard MPI_Recv might be waiting around for a slower send from another process.
Whether or not you block on the receiving end, you ought to call MPI_Waitall before finishing the loop to ensure all send/recv operations have completed successfully.
An additional point: you could leverage MPI_ANY_SOURCE with MPI_Recv to receive messages in a blocking manner and immediately operate on them, no matter what order they arrive. However, given that you specified that no operations on the data happen until all data are received, this might not be that useful.
Finally: as mentioned in these recommendations, you will get best performance if you can restructure your code so that you can just use MPI_SSend. In this case you avoid any buffering at all. To achieve this, you'd have to have all processes first call an MPI_Irecv, then begin sending via MPI_Ssend. It might not be as hard as you think to refactor in this way, particularly if, as you say, each process can work out independently which messages it will receive from whom.

How to ensure MPI asynchronous calls are executed in parallel?

I have a program that I am trying to accelerate, but I am not sure that the MPI asynchronous communication is behaving in the way I expect. I expect that a thread is created which performs the communication while the original thread can continue computation in parallel. How can I ensure the program executes this way?
The base program issues an Allgather every x iterations containing x iterations worth of future data. Any iteration only uses data produced x iterations ago. My idea was that instead of batching iterations of data together into a blocking Allgather (what the base program already does), I could issue an asynchronous Iallgather after every iteration and just check that the data has arrived x iterations from now. The program is ~90% computation so I thought this presented ample opportunity to hide the latencies of the communication. Unfortunately, the program is much slower now than before I started on it. The system has a message passing latency which is much smaller than the time it takes to perform x iterations of code for messages of this size.
I modified my code to try and debug a bit. I had 2 ideas about what could be causing the problem:
The asynchronous communication is just slower for some reason
The communication and computation are not being performed in parallel
To test (1), I turned all of the Iallgathers into Allgathers - the results were identical. This led me to believe that my Iallgathers were not being performed in parallel.
To test (2), I am not at all sure I did this right. I thought that calling MPI_Wait(&my_handle, ...) might force the calling thread to progress the transmission so I did something like this.
MPI_Iallgather(send_data, send_data_size, ..., &handle);
#pragma omp task
{
MPI_Wait(&handle, MPI_STATUS_IGNORE);
}
This may be the wrong approach, maybe the thread that issues the Iallgather has to perform it.
In summary: I want computation to continue while an Iallgather or any asynchronous communication call is being performed in parallel. How do I ensure that these are being performed as I expect? Is there a chance that it is being performed in parallel and the performance is garbage? I would expect that I would at least see a difference in execution time when switching Iallgather to Allgather in my code.
I realize that some answers to this question are likely to be iplementation dependent so I'm using openMPI 5.0.0a1.

In DX12 what Ordering Guarantees do multiple ExecuteCommandLists calls provide?

Assuming a single threaded application. If you call ExecuteCommandLists twice (A and B). Is A guaranteed to execute all of its commands on the GPU before starting any of the commands from B? The closest thing I can find in the documentation is this, but it doesn't really seem to guarantee A finishes before B starts:
Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.
As a point of comparison, I know that this is explicitly not guarenteed in Vulkan:
vkQueueSubmit is a queue submission command, with each batch defined by an element of pSubmits as an instance of the VkSubmitInfo structure. Batches begin execution in the order they appear in pSubmits, but may complete out of order.
However, I'm not sure if DX12 works the same way.
Frank Luna's book says:
The command lists are executed in order starting with the first array element
However in that context he's talking about calling ExecuteCommandLists once with two command lists (C and D). Do these operate the same as two individual calls? My colleague argues that this still only guarantees that they are started in order, not that C finishes before D starts.
Is there more clear documentation somewhere I'm missing?
I asked the same question in the Direct X forums, here's an answer from Microsoft engineer Jesse Natalie:
Calling ExecuteCommandLists twice guarantees that the first workload
(A) finishes before the second workload (B). Calling
ExecuteCommandLists with two command lists allows the driver to merge
the two command lists such that the second command list (D) may begin
executing work before all work from the first (C) has finished.
Specifically, the application is allowed to insert a fence signal or
wait between A and B, and the driver has no visibility into this, so
the driver must ensure that everything in A is complete before the
fence operation. There is no such opportunity in a single call to the
API, so the driver can optimize that scenario.
Source:
http://forums.directxtech.com/index.php?topic=5975.0
Finally the ID3D12CommandQueue is a first-in first-out queue, that stores the correct order of the command lists for submission to the GPU. Only when one command list has completed execution on the GPU, will the next command list from the queue be submitted by the driver.
https://learn.microsoft.com/en-us/windows/win32/direct3d12/porting-from-direct3d-11-to-direct3d-12
This isn't correct. I believe DirectX12 is the same as Vulkan
The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order
I've just ran into this again. Command list A is not guaranteed to complete before command list B starts. And this creates race conditions
A writes A reads
πŸ “ πŸ “
────────────────────
πŸ ‘ πŸ ‘
B writes B reads
Edit: It turns out I was doing something stupid (calling CopyTextureRegion on two buffers) and this was casing a stall (which I could see in pix) and so my work for my next frame was started during this stall resulting in a race condition sometimes. Usually the commands for one frame complete before the next start, and if they don't you will see a gap in PIX where no work is happening for the currently view frame's timings.

Creating a cuda stream on each host thread (multi-threaded CPU)

I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.
I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."
So if I want to implement this I would do something like
void main(int CPU_ThreadID) {
cudaStream_t *stream;
cudaStreamCreate(&stream);
int *d_a;
int *a;
cudaMalloc((void**)&d_a, 100*sizeof(int));
cudaMallocHost((void**)&a, 100*8*sizeof(int));
cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
sum<<<100,32,0,stream>>>(d_a);
cudaStreamDestroy(stream);
}
That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!
Edit:
I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.
It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.
A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.
Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.
Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.
Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.
You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)
Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.

Calls to GPU kernel from a multithreaded C++ application?

I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.