How many fences are necessary to support multiple frames in flight each using multiple command queues in DirectX 12?

How many fences are necessary to support multiple frames in flight each using multiple command queues in DirectX 12? - directx-12

Let's say I have a frame, that uses 2 copy queues, 1 graphics and 1 compute queue in order:
1) Upload data from CPU to GPU using 1st copy queue at the beginning of the frame (mesh vertices and such). That will be ExecuteCommandLists on 1st copy queue then SignalFence.
2) Build a ray tracing acceleration structure on async compute queue. WaitFence to wait for data we just uploaded, then ExecuteCommandLists to build accel. structure, then SignalFence.
3) WaitFence on graphics queue to wait for AS build then ExecuteCommandLists to render the frame. Then issue another SignalFence
4) WaitFence then ExecuteCommandLists on 2nd copy queue to perform data readback (GPU -> CPU), let's say to get terrain and physics back to the CPU. Then we call the final SignalFence for the frame.
Now, I want to have 3 frames buffered at all times to avoid CPU/GPU bubbles when no work is performed.
What would be the correct fence setup to achieve this?
So far I have implemented 2 variants 1 of which should work (unless I'm completely wrong) but it doesn't, and second works, but I'm not sure why. Please help me figure it out.
1) Have 2 fences (A and B) for all of the frames and queues:
For 1st frame:
CopyQueue1.ExecuteCommands();
CopyQueue1.SignalFence(A, 1);
AsyncComputeQueue.Wait(A, 1);
AsyncComputeQueue.ExecuteCommands();
AsyncComputeQueue.Signal(A, 2);
GraphicsQueue.Wait(A, 2);
GraphicsQueue.ExecuteCommands();
GraphicsQueue.Signal(A, 3);
CopyQueue2.Wait(A, 3);
CopyQueue2.ExecuteCommands();
CopyQueue2.Signal(B, 1);
Same thing for the next frames except that values for A and B will be incremented: 3, 4, 5 and 6, 7, 8 for A in frame 2 and 3, and values 2, 3 in frames 2 and 3 respectively for B.
At the end of render loop I perform a check to keep maximum of 3 frames in flight:
if (CurrentFrameBValue - B.SignalledValue() >= 3)
{
StallCurrentCPUThread();
}
ReleaseCommandListsForThisFrame();
// GoToNextRenderLoop
This code has an issue where B is being signaled very quickly, I do not stall the CPU and proceed to resetting command lists for corresponding frame and get debug layer error that says I was resetting command lists while GPU was still using them.
As I understand it, all work submitted to GPU is guarantied to be performed in submission order. So I expect fences to advance as follows: A - 1, 2, 3, then B to 1, then A to 4, 5, 6 then B to 2 and so forth.. Why is B signaled before all work for the frame is done?
2) Approach that's not emitting errors. Have 4 fences for each queue A, B, C, D, increment their values by one each frame, as we did for B in case 1.
1 reason I can see for 1st case failing is that work on GPU is not really done in order I expect it, and fence A can be signaled in an unpredictable order, messing up dependencies, while 2nd case has separate fences for each case..
I should also note that I don't have dependencies between frames: CopyQueue1 does not depend on CopyQueue2 via fences, I ensure correctness by keeping not more then 3 frames in flight with CPU stalling shown above.
Any thoughts?

I believe the problem was in using 1 fence for 3 different queues. Let's look at case 1. Copy1(Frame 1) -> AsyncCompute(Frame 1) -> Graphics(Frame 1) -> Copy2(Frame 1) then Copy1(Frame 2) -> AsyncCompute(Frame 2) -> Graphics(Frame 2) -> Copy2(Frame 2) all with the same fence object but different values.
In my case, I believe, Copy1(Frame 2) was done before AsyncCompute(Frame 1) or even Graphics(Frame 1), doesn't matter, because the fence value it signals is higher then anything expected in frame 1, messing up frame 1 dependencies and starting Copy2(Frame 1) too early which led to frame finish and reset of command lists while Async and/or Graphic work was actually still running.

Related

Cuda Multi-GPU Latency

I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

Learning about multithreading. Tried to make a prime number finder

I'm studying for a uni project and one of the requirements is to include multithreading. I decided to make a prime number finder and - while it works - it's rather slow. My best guess is that this has to do with the amount of threads I'm creating and destroying.
My approach was to take the range of primes that are below N, and distribute these evenly across M threads (where M = number of cores (in my case 8)), however these threads are being created and destroyed every time N increases.
Pseudocode looks like this:
for each core
# new thread
for i in (range / numberOfCores) * currentCore
if !possiblePrimeIsntActuallyPrime
if possiblePrime % i == 0
possiblePrimeIsntActuallyPrime = true
return
else
return
Which does work, but 8 threads being created for every possible prime seems to be slowing the system down.
Any suggestions on how to optimise this further?

Use thread pooling.
Create 8 threads and store them in an array. Feed it new data each time one ends and start it again. This will prevent them from having to be created and destroyed each time.
Also, when calculating your range of numbers to check, only check up to ceil(sqrt(N)) as anything after that is guaranteed to either not go into it or the other corresponding factor has already been checked. i.e. ceil(sqrt(24)) is 5.
Once you check 5 you don't need to check anything else because 6 goes into 24 4 times and 4 has been checked, 8 goes into it 3 times and 3 has been checked, etc.

SDL_Mixer is playing single chunk over itself possible?

I'm having trouble with SDL_Mixer (my lack of experience). Chunks and Music play just fine (using Mix_PlayChannel and Mix_PlayMusic), and playing two different chunks simultaneously isn't an issue.
My problem is that I would like to play some chunk1, and then play second iteration of chunk1 overlapping the first. I am trying to play a single chunk in rapid succession, but it instead plays the sound repeatedly at a much longer interval (not as quickly as I want). I've tested console output and my method of playing/looping is not at fault, since I can see console messages printing, looped at the right speed.
I have an array of Chunks that I periodically load during initialization, using Mix_LoadWAV();
Mix_Chunk *sounds[32];
I also have a function reserved for playing these chunks:
void PlaySound(int snd_id)
{
if(snd_id >= 0 && snd_id < 32)
{
if(Mix_PlayChannel(-1, sounds[snd_id], 0) == -1)
{
printf("Mix_PlayChannel: %s\n",Mix_GetError());
}
}
}
Attempting to play a single sound several times in rapid succession(say, 100ms delay/10bps), I am given the sound playing at a set, slower interval(some 500ms or so/2bps) despite the function being called at 10bps.
I already used "Mix_AllocateChannels(16);" to ensure I have allocated channels (let me know if I'm using that incorrectly) and still, a single chunk from the array refuses to play at a certain rate.
Any ideas/help is appreciated, as well as critique on how I posted this question.

As said in the documentation of SDL_Mixer (https://www.libsdl.org/projects/SDL_mixer/docs/SDL_mixer_28.html) :
"... -1 for the first free unreserved channel."
So if your chunk is longer than 1.6 seconds (16 channels*100ms) you'll run out of channels after 1.6 seconds, and so you wont be enabled to play new chunks until one of the channels end playing.
So there are basically 2 solutions :
Allocate more channels (more than : ChunkDuration (in sec) / Delay (in sec))
Stop a channel, so that you can use it. (and to do it properly, you should not use -1 as channel but a variable that you increment each time you play a chunk (don't forget to set it back to 0 when it's equal to your number of channels) )

Vulkan samples: vkQueueSubmit always followed by vkWaitForFences?

In the API-Samples that come with Vulkan, it appears that there is always a call to vkWaitForFences after a call to vkQueueSubmit, either directly or via execute_queue_command_buffer (in util_init.hpp). The call to vkWaitForFences will block the CPU execution until the GPU has finished doing all the work in the previous vkQueueSubmit. This in effect doesn't allow multiple frames to be constructed simultaneously, which (in theory) is limiting performance significantly.
Are these calls required, and if so, is there another way to not require the GPU to be idle before constructing a new frame?

The way we achieved multiple frames in flight is to have a fence for each swapchain framebuffer you have. Then still use the vkWaitForFences but wait for the ((n+1)%num_fences) fence.
There is example code here https://imgtec.com/tools/powervr-early-access-program/
uint32_t current_buffer = num_swaps_ % swapchain_fences.size();
vkQueueSubmit(graphics_queue, 1, &submit_info, swapchain_fences[current_buffer]);
// Wait for a queuesubmit to finish so we can continue rendering if we are n-2 frames behind
if(num_swaps_ > swapchain_fences.size() - 1)
{
uint32_t fence_to_wait_for = (num_swaps_ + 1) % swapchain_fences.size();
vkWaitForFences(device, 1, &swapchain_fences[fence_to_wait_for], true, UINT64_MAX);
vkResetFences(device, 1, &swapchain_fences[current_buffer]);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js