Vulkan samples: vkQueueSubmit always followed by vkWaitForFences? - c++

In the API-Samples that come with Vulkan, it appears that there is always a call to vkWaitForFences after a call to vkQueueSubmit, either directly or via execute_queue_command_buffer (in util_init.hpp). The call to vkWaitForFences will block the CPU execution until the GPU has finished doing all the work in the previous vkQueueSubmit. This in effect doesn't allow multiple frames to be constructed simultaneously, which (in theory) is limiting performance significantly.
Are these calls required, and if so, is there another way to not require the GPU to be idle before constructing a new frame?

The way we achieved multiple frames in flight is to have a fence for each swapchain framebuffer you have. Then still use the vkWaitForFences but wait for the ((n+1)%num_fences) fence.
There is example code here https://imgtec.com/tools/powervr-early-access-program/
uint32_t current_buffer = num_swaps_ % swapchain_fences.size();
vkQueueSubmit(graphics_queue, 1, &submit_info, swapchain_fences[current_buffer]);
// Wait for a queuesubmit to finish so we can continue rendering if we are n-2 frames behind
if(num_swaps_ > swapchain_fences.size() - 1)
{
uint32_t fence_to_wait_for = (num_swaps_ + 1) % swapchain_fences.size();
vkWaitForFences(device, 1, &swapchain_fences[fence_to_wait_for], true, UINT64_MAX);
vkResetFences(device, 1, &swapchain_fences[current_buffer]);
}

Related

Which way to synchronize vkQueueSubmit() to use?

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?
Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.
The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).
If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.
So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

WSI synchronization subpass dependency and link to color attachment output

I think I do understand how Vulkan synchronization works, but I do have a problem of understanding the synchronization with the WSI.
The the Synchronization Examples, we can find this code
/* Only need a dependency coming in to ensure that the first
layout transition happens at the right time.
Second external dependency is implied by having a different
finalLayout and subpass layout. */
VkSubpassDependency dependency = {
.srcSubpass = VK_SUBPASS_EXTERNAL,
.dstSubpass = 0,
// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.
.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.srcAccessMask = 0,
.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT,
.dependencyFlags = 0};
According to me, it should be something like that :
VkSubpassDependency dependency = {
.srcSubpass = VK_SUBPASS_EXTERNAL,
.dstSubpass = 0,
// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.
.srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE,
.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.srcAccessMask = 0,
.dstAccessMask = 0,
.dependencyFlags = 0};
Indeed, since we are going to WRITE into the attachment, there is no need to use a WRITE_BIT (meaning make writes available) in the dstAccessMask.
But the true issue is in the srcStageMask.
I understand why the dstStageMask is COLOR_ATTACHMENT_OUTPUT_BIT. It is because it is okay to have the prior stages working since we don't touch the attachment.
However, for the srcStageMask, I did not see anything about the link between WSI and the COLOR_ATTACHMENT_OUTPUT_BIT. To me, the layout transition must appear at the end of the presentation, and just before the beginning of the COLOR_ATTACHMENT_OUTPUT stage.
And the end of presentation, for me, should be represented by BOTTOM_OF_PIPE and not COLOR_ATTACHMENT_OUTPUT
Where am I mistaken?
Indeed, since we are going to WRITE into the attachment, there is no need to use a WRITE_BIT (meaning make writes available) in the dstAccessMask.
I am not exactly sure of your logic here. It should be WRITE exactly because we are going to write to the attachment.
It would perhaps help describe what is happening here (from krOoze/Hello_Triangle/doc):
The VkSubpassDependency chains of the the semaphore (via pWaitDstStageMask). Then it performs its layout transition. Then it syncs with the Load Op. (Then Load Op happens in the subpass. And subpass vkDraws some stuff into the swapchain image.)
Now assumably (as is typical for first use of the swapchain image) our Load Op is VK_ATTACHMENT_LOAD_OP_CLEAR. That means VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT + VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT access.
So:
Presentation is gonna read the swapchain image, and submit a semaphore signal.
We chain our VkSubpassDependency to the signal (via pWaitDstStageMask). Semaphore signal already covers all memory accesses, therefore our srcAccessMask = 0.
The Dependency performs its implicit dependency to the Layout Transition (takes your src, and invents some internal dst), then Layout Transition happens, which reads the image and writes it back, then the dependency performs another implicit Dependency (invents some src that covers the layout transition, and uses your dst).
The Load Op happens in the subpass, and you have to make sure explicitly it happens-after everything above. So your dst in your Dependency must be VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT + VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT to cover the Load Op.
Now, there are three types of memory hazards: read→write, write→write, and write→read (and read→read is a non-hazard). In WWH something writes a resource, while some other write also writes it. It is not clear which write happens last, so the memory contents becomes garbage. In RWH and WRH read and write may happen at the same time. It is not clear if the read sees the unmodified memory, or the written one (i.e. it reads garbage).
Without an explicit dependency, the Layout Transition and the subsequent Load Op form a write→write hazard. Therefore dstAccessMask must not be 0 to resolve the hazard and make sure one write happens-before the second one.
(It is perhaps worth noting we introduce the VkSubpassDependency solely for the sake of the layout transition. Otherwise the semaphore wait would already be all that is needed. The default is srcStageMask = TOP_OF_PIPE, which means without explicit Dependency the layout transition could happen before the semaphore wait, i.e. before presentation finishes reading it; that would be a read→write hazard).
To me, the layout transition must appear at the end of the presentation, and just before the beginning of the COLOR_ATTACHMENT_OUTPUT stage. And the end of presentation, for me, should be represented by BOTTOM_OF_PIPE and not COLOR_ATTACHMENT_OUTPUT
We have little bit of choice here: pWaitDstStageMask = dependency.srcStageMask = ?
Now our situation is this:
vkBeginCommandBuffer();
[possibly vkCmdDispatch()?]
vkBeginRenderPass();
vkCmdDraw(); // does vertex shading + fragment shading, then color writes
vkEndRenderPass();
vkEndCommandBuffer();
vkQueueSubmit(.pwaitDstStageMask = ?);
If we use VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, then the semaphore wait does not block starting the hypothetical vkCmdDispatch() (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). And the Subpass Dependency src does not force it to finish neither. Great!
The Stage used should not be any earlier stage than VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT. E.g. with VK_PIPELINE_STAGE_ALL_COMMANDS the semaphore wait would unnecesserily block the vkCmdDispatch() as well as vertex and fragment shader of the vkCmdDraw. Meanwhile we need the swapchain image only when we actually need to write it at the Load Op (which you can imagine happens when the vkCmdDraw() gets to the point it needs to perform color writes).
Now, we could choose VK_PIPELINE_STAGE_BOTTOM_OF_PIPE. The semaphore blocks nothing (dstStage\pWaitDstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE means the same as "nothing"). Great! But the Subpass Dependency now blocks everything (srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE means the same as VK_PIPELINE_STAGE_ALL_COMMANDS). That means our vkCmdDispatch has to finish, before our subpass starts. Not so great...
So, the best practice is to use pWaitDstStageMask = dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.

How many fences are necessary to support multiple frames in flight each using multiple command queues in DirectX 12?

Let's say I have a frame, that uses 2 copy queues, 1 graphics and 1 compute queue in order:
1) Upload data from CPU to GPU using 1st copy queue at the beginning of the frame (mesh vertices and such). That will be ExecuteCommandLists on 1st copy queue then SignalFence.
2) Build a ray tracing acceleration structure on async compute queue. WaitFence to wait for data we just uploaded, then ExecuteCommandLists to build accel. structure, then SignalFence.
3) WaitFence on graphics queue to wait for AS build then ExecuteCommandLists to render the frame. Then issue another SignalFence
4) WaitFence then ExecuteCommandLists on 2nd copy queue to perform data readback (GPU -> CPU), let's say to get terrain and physics back to the CPU. Then we call the final SignalFence for the frame.
Now, I want to have 3 frames buffered at all times to avoid CPU/GPU bubbles when no work is performed.
What would be the correct fence setup to achieve this?
So far I have implemented 2 variants 1 of which should work (unless I'm completely wrong) but it doesn't, and second works, but I'm not sure why. Please help me figure it out.
1) Have 2 fences (A and B) for all of the frames and queues:
For 1st frame:
CopyQueue1.ExecuteCommands();
CopyQueue1.SignalFence(A, 1);
AsyncComputeQueue.Wait(A, 1);
AsyncComputeQueue.ExecuteCommands();
AsyncComputeQueue.Signal(A, 2);
GraphicsQueue.Wait(A, 2);
GraphicsQueue.ExecuteCommands();
GraphicsQueue.Signal(A, 3);
CopyQueue2.Wait(A, 3);
CopyQueue2.ExecuteCommands();
CopyQueue2.Signal(B, 1);
Same thing for the next frames except that values for A and B will be incremented: 3, 4, 5 and 6, 7, 8 for A in frame 2 and 3, and values 2, 3 in frames 2 and 3 respectively for B.
At the end of render loop I perform a check to keep maximum of 3 frames in flight:
if (CurrentFrameBValue - B.SignalledValue() >= 3)
{
StallCurrentCPUThread();
}
ReleaseCommandListsForThisFrame();
// GoToNextRenderLoop
This code has an issue where B is being signaled very quickly, I do not stall the CPU and proceed to resetting command lists for corresponding frame and get debug layer error that says I was resetting command lists while GPU was still using them.
As I understand it, all work submitted to GPU is guarantied to be performed in submission order. So I expect fences to advance as follows: A - 1, 2, 3, then B to 1, then A to 4, 5, 6 then B to 2 and so forth.. Why is B signaled before all work for the frame is done?
2) Approach that's not emitting errors. Have 4 fences for each queue A, B, C, D, increment their values by one each frame, as we did for B in case 1.
1 reason I can see for 1st case failing is that work on GPU is not really done in order I expect it, and fence A can be signaled in an unpredictable order, messing up dependencies, while 2nd case has separate fences for each case..
I should also note that I don't have dependencies between frames: CopyQueue1 does not depend on CopyQueue2 via fences, I ensure correctness by keeping not more then 3 frames in flight with CPU stalling shown above.
Any thoughts?
I believe the problem was in using 1 fence for 3 different queues. Let's look at case 1. Copy1(Frame 1) -> AsyncCompute(Frame 1) -> Graphics(Frame 1) -> Copy2(Frame 1) then Copy1(Frame 2) -> AsyncCompute(Frame 2) -> Graphics(Frame 2) -> Copy2(Frame 2) all with the same fence object but different values.
In my case, I believe, Copy1(Frame 2) was done before AsyncCompute(Frame 1) or even Graphics(Frame 1), doesn't matter, because the fence value it signals is higher then anything expected in frame 1, messing up frame 1 dependencies and starting Copy2(Frame 1) too early which led to frame finish and reset of command lists while Async and/or Graphic work was actually still running.

CUDA streams not running in parallel

Given this code:
void foo(cv::gpu::GpuMat const &src, cv::gpu::GpuMat *dst[], cv::Size const dst_size[], size_t numImages)
{
cudaStream_t streams[numImages];
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamCreateWithFlags(&streams[image], cudaStreamNonBlocking);
dim3 Threads(32, 16);
dim3 Blocks((dst_size[image].width + Threads.x - 1)/Threads.x,
(dst_size[image].height + Threads.y - 1)/Threads.y);
myKernel<<<Blocks, Threads, 0, streams[image]>>>(src, dst[image], dst_size[image]);
}
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamSynchronize(streams[image]);
cudaStreamDestroy(streams[image]);
}
}
Looking at the output of nvvp, I see almost perfectly serial execution, even though the first stream is a lengthy process that the others should be able to overlap with.
Note that my kernel uses 30 registers, and all report an "Achieved Occupancy" of around 0.87. For the smallest image, Grid Size is [10,15,1] and Block Size [32, 16,1].
The conditions describing the limits for concurrent kernel execution are given in the CUDA programming guide (link), but the gist of it is that your GPU can potentially run multiple kernels from different streams only if your GPU has sufficient resources to do so.
In your usage case, you have said that you are running multiple launches of a kernel with 150 blocks of 512 threads each. Your GPU has 12 SMM (I think), and you could have up to 4 blocks per SMM running concurrently (4 * 512 = 2048 threads, which is the SMM limit). So your GPU can only run a maximum of 4 * 12 = 48 blocks concurrently. When multiple launches of 150 blocks sitting in the command pipeline, it would seem that there is little (perhaps even no) opportunity for concurrent kernel execution.
You might be able to encourage kernel execution overlap if you increase the scheduling granularity of you kernel by reducing the block size. Smaller blocks are more likely to find available resources and scheduling slots than larger blocks. Similarly, reducing the total block count per kernel launch (probably by increasing the parallel work per thread) might also help increase the potential for overlap or concurrent execution of multiple kernels.