Vulkan: How to record command buffers in separate thread? - c++

I don't properly understand how to parallelize work on separate threads in Vulkan.
In order to begin issuing vkCmd*s, you need to begin a render pass. The call to begin render pass needs a reference to a framebuffer. However, vkAcquireNextImageKHR() is not guaranteed to return image indexes in a round robin way. So, in a triple-buffering setup, if the current image index is 0, I can't just bind framebuffer 1 and start issuing draw calls for the next frame, because the next call to vkAcquireNextImageKHR() might return image index 2.
What is a proper way to record commands without having to specify the framebuffer to use ahead of time?

You have one or more render passes that you want to execute per-frame. And each one has one or more subpasses, into which you want to pour work. So your main rendering thread will generate one or more secondary command buffers for those subpasses, and it will pass that sequence of secondary CBs off to the submission thread.
The submissions thread will create the primary CB that gets rendered. It begins/ends render passes, and into each subpass, it executes the secondary CB(s) created on the rendering thread for that particular subpass.
So each thread is creating its own command buffers. The submission thread is the one that deals with the VkFramebuffer object, since it begins the render passes. It also is the one that acquires the swapchain images and so forth. The render thread is the one making the secondary CBs that do all of the real work.
Yes, you'll still be doing some CB building on the submission thread, but it ought to be pretty minimalistic overall. This also serves to abstract away the details of the render targets from your rendering thread, so that code dealing with the swapchain can be localized to the submission thread. This gives you more flexibility.
For example, if you want to triple buffer, and the swapchain doesn't actually allow that, then your submission thread can create its own extra images, then copy from its internal images into the real swapchain. The rendering thread's code does not have to be disturbed at all to allow this.

You can use multiple threads to generate draw commands for the same renderpass using secondary command buffers. And you can generate work for different renderpasses in the same frame in parallel -- only the very last pass (usually a postprocess pass) depends on the specific swapchain image, all your shadow passes, gbuffer/shading/lighting passes, and all but the last postprocess pass don't. It's not required, but it's often a good idea to not even call vkAcquireNextImageKHR until you're ready to start generating the final renderpass, after you've already generated many of the prior passes.

First, to be clear:
In order to begin issuing vkCmd*s, you need to begin a render pass.
That is not necessarily true. In command buffers You can record multiple different commands, all of which begin with vkCmd. Only some of these commands need to recorded inside a render pass - the ones that are connected with drawing. There are some commands, which cannot be called inside a render pass (like for example dispatching compute shaders). But this is just a side note to sort things out.
Next thing - mentioned triple buffering. In Vulkan the way images are displayed depends on the supported present mode. Different hardware vendors, or even different driver versions, may offer different present modes, so on one hardware You may get present mode that is most similar to triple buffering (MAILBOX), but on other You may not get it. And present mode impacts the way presentation engine allows You to acquire images from a swapchain, and then displays them on screen. But as You noted, You cannot depend on the order of returned images, so You shouldn't design Your application to behave as if You always have the same behavior on all platforms.
But to answer Your question - the easiest, naive, way is to call vkAcquireNextImageKHR() at the beginning of a frame, record command buffers that use an image returned by it, submit those command buffers and present the image. You can create framebuffers on demand, just before You need to use it inside a command buffer: You create a framebuffer that uses appropriate image (the one associated with index returned by the vkAcquireNextImageKHR() function) and after command buffers are submitted and when they stop using it, You destroy it. Such behavior is presented in the Vulkan Cookbook: here and here.
More appropriate way would be to prepare framebuffers for all available swapchain images and take appropriate framebuffer during a frame. But You need to remember to recreate them when You recreate swapchain.
More advanced scenarios would postpone swapchain acquiring until it is really needed. vkAcquireNextImageKHR() function call may block Your application (wait until image is available) so it should be called as late as possible when You prepare a frame. That's why You should record command buffers that don't need to reference swapchain images first (for example those that render geometry into a G-buffer in deferred shading algorithms). After that when You want to display image on screen (like for example some postprocessing technique) You just take the approach describe above: acquire an image, prepare appropriate command buffer(s) and present the image.
You can also pre-record command buffers that reference particular swapchain images. If You know that the source of Your images will always be the same (like the mentioned G-buffer), You can have a set of command buffers that always perform some postprocess/copy-like operations from this data to all swapchain images - one command buffer per swapchain image. Then, during the frame, if all of Your data is set, You acquire an image, check which pre-recorded command buffer is appropriate and submit the one associated with acquired image.
There are multiple ways to achieve what You want, all of them depend on many factors - performance, platform, specific goal You want to achieve, type of operations You perform in Your application, synchronization mechanisms You implemented and many other things. You need to figure out what best suits You. But in the end - You need to reference a swapchain image in command buffers if You want to display image on screen. I'd suggest starting with the easiest option first and then, when You get used to it, You can improve Your implementation for higher performance, flexibility, easier code maintenance etc.

You can call vkAcquireNextImageKHR in any thread. As long as you make sure the access to the swapchain, semaphore and fence you pass to it is synchronized.
There is nothing else restricting you from calling it in any thread, including the recording thread.
You are also allowed to have multiple images acquired at a time. Assuming you have created enough. In other words acquiring the next image before you present the current one is allowed.

Related

Is synchronization needed between multiple draw calls with transparency in Vulkan?

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.
Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

How to stop clearing between command buffers?

I'm trying to get ImGui working in my engine but having some trouble "overlaying" it over my cube mesh. I split the two in seperate command buffers like
std::array<VkCommandBuffer, 2> cmdbuffers = { commandBuffers[imageIndex], imguicmdbuffers[imageIndex] };
And then in my queue submit info I put the command buffer count to 2 and pass it the data like so
submitInfo.commandBufferCount = 2;
submitInfo.pCommandBuffers = cmdbuffers.data();
But what happens now is that it only renders imgui, or if I switch the order in the array it only renders the cube, never both. Is it because they share the same render pass? I changed the VkRenderPassBeginInfo clear color to double check and indeed it either clears yellow and draws imgui or clears red and draws the cube. I've tried setting the clear alpha to 0 but that doesn't work and seems like a hack anyway. I feel like I lack understanding of how it submits and executes the command buffers and how it's tied to render passes/framebuffers, so whats up?
Given the following statements (that is, assuming they are accurate):
they share the same render pass
in my queue submit info I put the command buffer count to 2
VkRenderPassBeginInfo clear color
Certain things about the nature of your rendering become apparent (things you didn't directly state or provide code for). First, you are submitting two separate command buffers directly to the queue. Only primary command buffers can be submitted to the queue.
Second, by the nature of render passes, a render pass instance cannot span primary command buffers. So you must have two render pass instances.
Third, you specify that you can change the clear color of the image when you begin the render pass instance. Ergo, the render pass must specify that the image gets cleared as its load-op.
From all of this, I conclude that you are beginning the same VkRenderPass twice. A render pass that, as previously deduced, is set to clear the image at the beginning of the render pass instance. Which will dutifully happen both times, the second of which will wipe out everything that was rendered to that image beforehand.
Basically, you have two rendering operations, using a render pass that's set to destroy the data created by any previous rendering operation to the images it uses. That's not going to work.
You have a few ways of resolving this.
My preferred way would be to start employing secondary command buffers. I don't know if ImGui can be given a CB to record its data into. But if it can, I would suggest making it record its data into a secondary CB. You can then execute that secondary CB into the appropriate subpass of your renderpass. And thus, you don't submit two primary CBs; you only submit one.
Alternatively, you can make a new VkRenderPass which doesn't clear the previous image; it should load the image data instead. Your second rendering operation would use that render pass, while your initial one would retain the clear load-op.
Worst-case scenario, you can have the second operation render to an entirely different image, and then merge it with the primary image in a third rendering operation.

Beating the state machine

I'm working on a plugin for a scripting language that allows the user to access the OpenGL 1.1 command set. On top of that, all functions of the scripting language's own gfx command set are transparently redirected to appropriate OpenGL calls. Normally, the user should use either the OpenGL command set or the scripting language's inbuilt gfx command set which basically contains just your typical 2D drawing commands like DrawLine(), DrawRectangle(), DrawPolygon(), etc.
Under certain conditions, however, the user might want to mix calls to the OpenGL and the inbuilt gfx command sets. This leads to the problem that my OpenGL implementations of inbuilt commands like DrawLine(), DrawRectangle(), DrawPolygon(), etc. have to be able to deal with whatever state the OpenGL state machine might currently be in.
Therefore, my idea was to first save all state information on the stack, then prepare a clean OpenGL context needed for my implementations of commands like DrawLine(), etc. and then restore the original state. E.g. something like this:
glPushAttrib(GL_ALL_ATTRIB_BITS);
glPushClientAttrib(GL_CLIENT_ALL_ATTRIB_BITS);
glPushMatrix();
....prepare OpenGL context for my needs.... --> problem: see below #2
....do drawing....
glPopMatrix();
glPopClientAttrib();
glPopAttrib();
Doing it like this, however, leads to several problems:
glPushAttrib() doesn't push all attributes, e.g save pixel pack and unpack state, render mode state, and select and feedback state are not pushed. Also, extension states are not saved. Extension states are not important as my plugin is not designed to support extensions. Saving and restoring other information (pixel pack and unpack) could probably be implemented manually using glGet().
Big problem: How should I prepare the OpenGL context after having saved all state information? I could save a copy of a "clean" state on the stack right after OpenGL's initialization and then try to pop this stack but for this I'd need a function to move data inside the stack, i.e. I'd need a function to copy or move a saved state from the back to the top of stack so that I can pop it. But I didn't see a function that can accomplish this...
It's probably going to be very slow but this is something I could live with because the user should not mix OpenGL and inbuilt gfx calls. If he does nevertheless, he will have to live with a very poor performance.
After these introductory considerations I'd finally like to present my question: Is it possible to "beat" the OpenGL state machine somehow? By "beating" I mean the following: Is it possible to completely save all current state information, then restore the default state and prepare it for my needs and do the drawing, and then finally restore the complete previous state again so that everything is exactly as it was before. For example, an OpenGL based version of the scripting language's DrawLine() command would do something like this then:
1. Save all current state information
2. Restore default state, set up a 2D projection matrix
3. Draw the line
4. Restore all saved state information so that the state is exactly the same as before
Is that possible somehow? It doesn't matter if it's very slow as long as it is 100% guaranteed to put the state into exactly the same state as it was before.
You can simply use different contexts, especially if you do not care about performance. Just keep an context for your internal gfx operations and another one the user might mess with and just bind the appropriate one to your window (and thread).
The way you describe it looks like you never want to share objects with the user's GL stuff, so simple "unshared" contexts will do fine. All you seem to want to share is the framebuffer - and the GL framebuffer (including back and front color buffers, depth buffer, stencil, etc..) is part of the drawable/window, not the context - so you get access to it whit any context when you make the context current. Changing the contexts mid-frame is not a problem.

How to perform asynchronous off-screen queries?

I have several (potentially thousands) of scenes I would like to render in order to perform queries on what is drawn. The problem I am running into is that calls to glGetQueryObjectiv() are expensive, so I'd like to figure out a way to render several scenes in advance while I wait for the results of the queries to become available.
I have read a bit about Framebuffer Objects and Pixel Buffer Objects, but mostly in the context of saving to a file using glReadPixels() and I haven't been able to track down an example of using either of these objects in asynchronous queries. Is there any reason why the setup for performing glGetQueryObjectiv() would be different from glReadPixels() as in this example (e.g., should I use an FBO or a PBO)?
Note:
This is NOT for a graphics application. All I'm interested in is the result of the GL_SAMPLES_PASSED query (i.e. how many pixels were drawn?).
The specific application is estimating how much sunlight is striking a surface when other surfaces are casting shadows. If you are interested, you can read about it here.
Neither Framebuffer Objects nor (Pixel) Buffer Objects are the proper class of objects to use in an (asynchronous) query; GL has an entirely separate class of objects called Query Objects. You can feed the results of certain queries into Buffer Objects, but that is about as far as their relationship goes.
Basically the idea with asynchronous queries is that you put the query into the graphics pipeline and let some time pass before you try to read it back. This gives the GPU time to actually finish all of the commands in between the beginning and end of your query first. The alternative is a synchronous query, which means that your calling thread stops doing anything useful and waits for the GPU to finish everything necessary to complete your query.
The simplest way of implementing a query that does not force synchronization would be to start and finish your query as you would typically, but rather than reading the result immediately, in your application's main loop periodically check the status of GL_QUERY_RESULT_AVAILABLE for your query object. When this gives you GL_TRUE, it means you can read the results from the query object without disrupting the render pipeline. If you do try to get the results from a query object before it is in this state, you are going to force a synchronous query.

A way of generating chunks

I'm making a game and I'm actually on the generation of the map.
The map is generated procedurally with some algorithms. There's no problems with this.
The problem is that my map can be huge. So I've thought about cutting the map in chunks.
My chunks are ok, they're 512*512 pixels each, but the only problem is : I have to generate a texture (actually a RenderTexture from the SFML). It takes around 0.5ms to generate so it makes the game to freeze each time I generate a chunk.
I've thought about a way to fix this : I've made a kind of a threadpool with a factory. I just have to send a task to it and it creates the chunk.
Now that it's all implemented, it raises opengl warnings like :
"An internal OpenGL call failed in RenderTarget.cpp (219) : GL_INVALID_OPERATION, the specified operation is not allowed in the current state".
I don't know if this is the good way of dealing with chunks. I've also thought about saving the chunks into images / files, but I fear that it take too much time to save / load them.
Do you know a better way to deal with this kind of "infinite" maps ?
It is an invalid operation because you must have a context bound to each thread. More importantly, all of the GL window system APIs enforce a strict 1:1 mapping between threads and contexts... no thread may have more than one context bound and no context may be bound to more than one thread. What you would need to do is use shared contexts (one context for drawing and one for each worker thread), things like buffer objects and textures will be shared between all shared contexts but the state machine and container objects like FBOs and VAOs will not.
Are you using tiled rendering for this map, or is this just one giant texture?
If you do not need to update individual sub-regions of your "chunk" images you can simply create new textures in your worker threads. The worker threads can create new textures and give them data while the drawing thread goes about its business. Only after a worker thread finishes would you actually try to draw using one of the chunks. This may increase the overall latency between the time a chunk starts loading and eventually appears in the finished scene but you should get a more consistent framerate.
If you need to use a single texture for this, I would suggest you double buffer your texture. Have one that you use in the drawing thread and another one that your worker threads issue glTexSubImage2D (...) on. When the worker thread(s) finish updating their regions of the texture you can swap the texture you use for drawing and updating. This will reduce the amount of synchronization required, but again increases the latency before an update eventually appears on screen.
things to try:
make your chunks smaller
generate the chunks in a separate thread, but pass to the gpu from the main thread
pass to the gpu a small piece at a time, taking a second or two