Is synchronization needed between multiple draw calls with transparency in Vulkan? - c++

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.

Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

Related

C++ OpenGL wrapper: interface similar to fixed pipeline, can export .collada

I have opengl code that uses the fixed pipeline.
Hitting two birds with one stone, I need a wrapper that can help me with the following tasks:
Convert the code to the new shader-based pipeline with minimal effort.
I have a class that calls opengl functions, such as: glBegin(triangles/lines), glVertex, glPushMatrix, glTranslate, glColor, gluSphere.
Ideally, I'd like it to derive from a class that supplies these functions in the base class. Behind the scenes, it would use the same high level logic as the fixed pipeline.
I'd like to export an opengl scene to .collada to load in an external renderer.
Opengl is low level rendering, and it doesn't have the concept of a scene. For example, this reddit post:
"You realize that you have to write a shim to capture all API calls
you are interested in to do that. Then, when finally, a draw call is
emitted you have to parse every single vertex and collect the data
from all over the memory from the buffers that you have recorded from
the APi calls that set up VAOs, VBOs and IBOs. Then you have to parse
the shader source code so that you can see which uniforms and vertex
attributes contribute to vertex clip coordinate generation. Then you
also have to synthesize/guess which outputs are normal, color, texture
coordinate and so on from the shader source if the resulting program
even have those in .obj file format-wise.
This gets even more complicated if Compute is used to generate data
inside the GPU for any of the buffers. If geometry or tessellator is
used then you also have to implement one of those so that you get
accurate outputs from the vertex processing. TL;DR - you have to write
your own OpenGL 4.5 driver that does exactly the same things a real
hardware driver would do. Good luck with that."
However, my scene is simple, using the fixed pipeline operations above.
I'd like the wrapper to keep track and construct a scene that can be exported.
--
EDIT: Since recommendation is off-topic, I'll ask the following question.
What I need above seems like something obvious that many should have found useful. Since I can't find a library that accomplishes that, I'm wondering if my approach is unreasonable?
More specifically, how do people port their legacy opengl code; do they write the relevant part from scratch, or does everyone implement his own wrapper as I suggested?
What about constructing a scene to export to collada?
Posted also:
https://community.khronos.org/t/c-opengl-wrapper-interface-similar-to-fixed-pipeline-can-export-collada/105829
Although there are some parts in legacy OpenGL that are not optimized in current drivers (like glDrawPixels, the raster drawing operations and indexed color mode), between modern hardware and the modest requirements of legacy applications, legacy OpenGL stuff runs well enough on modern systems.
The main reason to "modernize" legacy OpenGL code is, if one want to make use of the modern features. Any sort of "wrapper" will just run into the same kind of design problems that the OpenGL API ran between OpenGL-1.5 to OpenGL-2.1: Lots of built-in variables, default state, implicit action, etc. etc. This is difficult to document properly, and even more difficult to make use of reliably. Which is the reason you usually don't find these kinds of wrappers.
If you find yourself in the situation, that you absolutely must port your legacy code to modern OpenGL, e.g. to be interoperable with core contexts, then your best course of action will be to do a proper rewrite. Replace implcit mode calls to filling vertex buffers, replace calls to glTexEnv…, glMaterial…, glLight… with loading appropriate shaders and setting their uniforms.
Or, if you want a quick and dirty method: Just create two contexts, a modern one, and a legacy one and switch between them; often you can establish "list" sharing between them.

Vulkan: How to record command buffers in separate thread?

I don't properly understand how to parallelize work on separate threads in Vulkan.
In order to begin issuing vkCmd*s, you need to begin a render pass. The call to begin render pass needs a reference to a framebuffer. However, vkAcquireNextImageKHR() is not guaranteed to return image indexes in a round robin way. So, in a triple-buffering setup, if the current image index is 0, I can't just bind framebuffer 1 and start issuing draw calls for the next frame, because the next call to vkAcquireNextImageKHR() might return image index 2.
What is a proper way to record commands without having to specify the framebuffer to use ahead of time?
You have one or more render passes that you want to execute per-frame. And each one has one or more subpasses, into which you want to pour work. So your main rendering thread will generate one or more secondary command buffers for those subpasses, and it will pass that sequence of secondary CBs off to the submission thread.
The submissions thread will create the primary CB that gets rendered. It begins/ends render passes, and into each subpass, it executes the secondary CB(s) created on the rendering thread for that particular subpass.
So each thread is creating its own command buffers. The submission thread is the one that deals with the VkFramebuffer object, since it begins the render passes. It also is the one that acquires the swapchain images and so forth. The render thread is the one making the secondary CBs that do all of the real work.
Yes, you'll still be doing some CB building on the submission thread, but it ought to be pretty minimalistic overall. This also serves to abstract away the details of the render targets from your rendering thread, so that code dealing with the swapchain can be localized to the submission thread. This gives you more flexibility.
For example, if you want to triple buffer, and the swapchain doesn't actually allow that, then your submission thread can create its own extra images, then copy from its internal images into the real swapchain. The rendering thread's code does not have to be disturbed at all to allow this.
You can use multiple threads to generate draw commands for the same renderpass using secondary command buffers. And you can generate work for different renderpasses in the same frame in parallel -- only the very last pass (usually a postprocess pass) depends on the specific swapchain image, all your shadow passes, gbuffer/shading/lighting passes, and all but the last postprocess pass don't. It's not required, but it's often a good idea to not even call vkAcquireNextImageKHR until you're ready to start generating the final renderpass, after you've already generated many of the prior passes.
First, to be clear:
In order to begin issuing vkCmd*s, you need to begin a render pass.
That is not necessarily true. In command buffers You can record multiple different commands, all of which begin with vkCmd. Only some of these commands need to recorded inside a render pass - the ones that are connected with drawing. There are some commands, which cannot be called inside a render pass (like for example dispatching compute shaders). But this is just a side note to sort things out.
Next thing - mentioned triple buffering. In Vulkan the way images are displayed depends on the supported present mode. Different hardware vendors, or even different driver versions, may offer different present modes, so on one hardware You may get present mode that is most similar to triple buffering (MAILBOX), but on other You may not get it. And present mode impacts the way presentation engine allows You to acquire images from a swapchain, and then displays them on screen. But as You noted, You cannot depend on the order of returned images, so You shouldn't design Your application to behave as if You always have the same behavior on all platforms.
But to answer Your question - the easiest, naive, way is to call vkAcquireNextImageKHR() at the beginning of a frame, record command buffers that use an image returned by it, submit those command buffers and present the image. You can create framebuffers on demand, just before You need to use it inside a command buffer: You create a framebuffer that uses appropriate image (the one associated with index returned by the vkAcquireNextImageKHR() function) and after command buffers are submitted and when they stop using it, You destroy it. Such behavior is presented in the Vulkan Cookbook: here and here.
More appropriate way would be to prepare framebuffers for all available swapchain images and take appropriate framebuffer during a frame. But You need to remember to recreate them when You recreate swapchain.
More advanced scenarios would postpone swapchain acquiring until it is really needed. vkAcquireNextImageKHR() function call may block Your application (wait until image is available) so it should be called as late as possible when You prepare a frame. That's why You should record command buffers that don't need to reference swapchain images first (for example those that render geometry into a G-buffer in deferred shading algorithms). After that when You want to display image on screen (like for example some postprocessing technique) You just take the approach describe above: acquire an image, prepare appropriate command buffer(s) and present the image.
You can also pre-record command buffers that reference particular swapchain images. If You know that the source of Your images will always be the same (like the mentioned G-buffer), You can have a set of command buffers that always perform some postprocess/copy-like operations from this data to all swapchain images - one command buffer per swapchain image. Then, during the frame, if all of Your data is set, You acquire an image, check which pre-recorded command buffer is appropriate and submit the one associated with acquired image.
There are multiple ways to achieve what You want, all of them depend on many factors - performance, platform, specific goal You want to achieve, type of operations You perform in Your application, synchronization mechanisms You implemented and many other things. You need to figure out what best suits You. But in the end - You need to reference a swapchain image in command buffers if You want to display image on screen. I'd suggest starting with the easiest option first and then, when You get used to it, You can improve Your implementation for higher performance, flexibility, easier code maintenance etc.
You can call vkAcquireNextImageKHR in any thread. As long as you make sure the access to the swapchain, semaphore and fence you pass to it is synchronized.
There is nothing else restricting you from calling it in any thread, including the recording thread.
You are also allowed to have multiple images acquired at a time. Assuming you have created enough. In other words acquiring the next image before you present the current one is allowed.

How to isolate my own OpenGL calls inside a third-party process?

I am writing small tool that is drawing OpenGL overlay on top of the game which is closed source. The game is using SDL, so I am just hooking into SDL_GL_SwapWindow and doing my own stuff. However, this kind of hooking results in some side effects in the game itself. I found a solution that is basically wrapping around my own calls with deprecated glPushAttrib/glPopAttrib. But this solves only half of the problems. I am still getting random texture flickering in the game (I meant game textures, mine are showing fine). What could be the reason of this flickering? Can my own textures interfere with game textures? Do I need to isolate my own calls and how can I do it?
What could be the reason of this flickering?
If the game uses shaders, then glPushAttrib / glPopAttrib will not take care of all the state you may be clobbering with. The attribute stack has been deprecated and the program may use states that are either not covered by it, or where certain attribute bits in compatibility profile have been reused or expanded to cover further state. I recommend not using the attribute stack at all, because it's hard to get right.
Can my own textures interfere with game textures?
Yes. Say you left a 2D texture active in a texture unit that's later being used for a 1D texture. If the host program does not use shaders, then the GL_TEXTURE_2D will take precedence over the GL_TEXTURE_1D. It's a (IMHO poor) design choice of OpenGL that you can have multiple texture targets being bound to the same texture unit at the same time and which one is used to deliver texels depends on the individual targets' precedence.
Do I need to isolate my own calls
Yes.
and how can I do it?
Two possible solutions:
Create separate OpenGL context for just your own stuff. Use {wgl,glX}GetCurrentContext and {wglGetCurrentDC,glXGetCurrentDrawable} to retrieve the OpenGL context and drawable active at the moment you're "jumping" in. If you don't have a context already, you can use the drawable just retrieved to create a matching OpenGL context. Optionally install a namespace sharing. Switch to your context, draw your stuff and switch back to the host program one's. – Major drawback: Switching OpenGL contexts is quite expensive.
Before switching state around, use glGet… to retrieve the state active before doing so and restore the old state before returning to the host program.

Compute Shader basics in dx11

I am about to add compute shader support to my codebase and having problems finding answers to some pretty basic questions:
All documentation out there says that Compute Shader pipeline runs independently from the GPU, however all dx11 sample code uses the device context interface to set the shader itself, resource views and calling the dispatch() method, so do these get queued up in the command buffer with the rest of the rendering commands or do they get executed independently?
Following up on question 1, can I invoke compute shaders from multiple threads or do I need to buffer all compute shader commands and issue them on the thread that the immediate device context was created on?
Synchronization. Most articles use the CopyResource command which will automatically synchronize compute shader completion and give CPU access to the results, but seems like that would block the GPU as well. Is there a more efficient way to synchronize?
I know I could find answers to this by experimenting, but any help that saves me time would be appreciated.
The Compute Shader pipeline runs independently from the Rendering pipeline, i.e. vertex shaders, pixel shaders, blend states, etc. have no effect on what happens when you call Dispatch(). However, they do go into the same queue, so ordering between calls to Draw and Dispatch are preserved.
All calls to the immediate context must be done from a single thread.
One common approach is to use two buffers. While one is being operated on with the compute shader, the other is being copied back and read by the CPU. Most GPUs will be able to parallelize this.

How can you draw primitives in OpenGL interactively?

I'm having a rough time trying to set up this behavior in my program.
Basically, I want it that when a the user presses the "a" key a new sphere is displayed on the screen.
How can you do that?
I would probably do it by simply having some kind of data structure (array, linked list, whatever) holding the current "scene". Initially this is empty. Then when the event occurs, you create some kind of representation of the new desired geometry, and add that to the list.
On each frame, you clear the screen, and go through the data structure, mapping each representation into a suitble set of OpenGL commands. This is really standard.
The data structure is often referred to as a scene graph, it is often in the form of a tree or graph, where geometry can have child-geometries and so on.
If you're using the GLuT library (which is pretty standard), you can take advantage of its automatic primitive generation functions, like glutSolidSphere. You can find the API docs here. Take a look at section 11, 'Geometric Object Rendering'.
As unwind suggested, your program could keep some sort of list, but of the parameters for each primitive, rather than the actual geometry. In the case of the sphere, this would be position/radius/slices. You can then use the GLuT functions to easily draw the objects. Obviously this limits you to what GLuT can draw, but that's usually fine for simple cases.
Without some more details of what environment you are using it's difficult to be specific, but a few of pointers to things that can easily go wrong when setting up OpenGL
Make sure you have the camera set up to look at point you are drawing the sphere. This can be surprisingly hard, and the simplest approach is to implement glutLookAt from the OpenGL Utility Toolkit. Make sure you front and back planes are set to sensible values.
Turn off backface culling, at least to start with. Sure with production code backface culling gives you a quick performance gain, but it's remarkably easy to set up normals incorrectly on an object and not see it because you're looking at the invisible face
Remember to call glFlush to make sure that all commands are executed. Drawing to the back buffer then failing to call glSwapBuffers is also a common mistake.
Occasionally you can run into issues with buffer formats - although if you copy from sample code that works on your system this is less likely to be a problem.
Graphics coding tends to be quite straightforward to debug once you have the basic environment correct because the output is visual, but setting up the rendering environment on a new system can always be a bit tricky until you have that first cube or sphere rendered. I would recommend obtaining a sample or template and modifying that to start with rather than trying to set up the rendering window from scratch. Using GLUT to check out first drafts of OpenGL calls is good technique too.