I am unclear from the OpenGL Specification on Sync objects, whether to use glGetSynciv or glClientWaitSync in case I want to check for signalling of a sync object without waiting. How do the following two commands compare in terms of behavior and performance:
GLint syncStatus;
glGetSynciv(*sync, GL_SYNC_STATUS, sizeof(GLint), NULL, &syncStatus);
bool finished = syncStatus == GL_SIGNALED;
vs
bool finished = glClientWaitSync(*sync, 0 /*flags*/, 0 /*timeout*/) == ALREADY_SIGNALED;
Some details to the questions:
Does glGetSynciv perform a roundtrip to the GL server?
Is any method preferred in terms of driver support / bugs?
Could either method deadlock or not return immediately?
Some context:
This is for a video player, which is streaming images from a physical source to the GPU for rendering.
One thread is streaming / continuously uploading textures and another thread renders them once they are finished uploading. Each render frame we are checking if the next texture has finished uploading. If it has, then we start rendering this new texture, otherwise continue to using the old texture.
The decision is client side only and I do not want to wait at all, but quickly continue to render the correct texture.
Both methods have examples of people using them for the purpose of not waiting, but none seem to discuss the merits of using one or the other.
Quoting the Red Book,
void glGetSynciv(GLsync sync, GLenum pname, GLsizei bufSize, GLsizei *lenght, GLint *values);
Retrieves the properties of a sync object. sync specifies a handle to the sync object from wich to read the property specified by pname. bufSize is the size in bytes of the buffer whose address is given in values. lenght is the address of an integer variable that will receive the number of bytes written into values
While for glClientWaitSync:
GLenum glClientWaitSync(GLsync sync, GLbitfields flags, GLuint64 timeout);
Causes the client to wait for the sync object to become signaled.
glClientWaitSync() will wait at most timeout nanoseconds for the object to become signaled before generating a timeout. The flags parameter may be used to control flushing behavior of the command. Specifying GL_SYNC_FLUSH_COMMANDS_BIT is equivalent to calling glFlush() before executing wait.
So, basically glGetSynciv() is used to know if the fence object has become signaled and glClientWaitSync() is used to wait until the fence object has become signaled.
If you only want to know if a fence object has become signaled, I would suggest using glGetSynciv().
Obviously glClientWaitSync() should take longer to execute then glGetSynciv(), but I'm guessing.
Hope i helped you.
Related
I heard of PBOs in OpenGL, they are pretty neat in texture loading / uploading. Using PBO involves using synchronisation fences, like they are used in this PBO tutorial. I've tried this technique and it's a great replacement of glTexImage and glGetTexImage in case of big images. Now I want to apply the same approach to other loading / uploading routines (and possibly some other).
If I understand OpenGL client—server model correctly, it works as follows:
(Italics are the things I am not sure of)
Client (my program) 'asks' OGL context to place new commands into OGL command queue. It is done by simply calling the gl* functions in the order that client wants them to execute;
If command flushing is enabled (it is by default), commands are immidiately flushed to the GPU (e. g. via PCI). Otherwise they are placed in some buffer and flushed afterwards when needed (call to glFlush does this);
GPU (server) receives commands from OGL context and executes them in desired order and changes object / context state;
GPU sends back a reply that context (client) asked for;
Context replies to client with the data received from server.
Fences may be used to indicate whether GPU is done executing previous commands.
(2) implies that command execution is not necessarily done instantly. One can, for example, block command flushing by calling glWaitSync, place new commands into queue and then call glFlush. Commands will be flushed to GPU and executed asynchronously (independently from client). When GPU is busy executing given commands, CPU can focus on doing other stuff (e. g. sending info to a remote TCP server, or receiving input from user, or pretty much anything else). When CPU needs to perform something with OGL context again, it can wait until GPU is done with previous job by calling glClientWaitSync and place new commands in the queue, and the cycle repeats.
Based on all of the above, in case of that PBO tutorial, OGL context receives data from the program, buffers it and then sends it to GPU. Sending large amounts of data takes time, hence fence is used to know when sending is complete.
However, Khronos wiki says that only rendering functions are asynchronous. I understand it: rendering also takes time. But then why does the PBO example above work? And it's not like the image upload to GPU is instant, fence is not signaled instantly. Surely, the time it takes to finish uploading depends on how big is the image (I tested it with different image sizes).
Another example: I send a source code for a shader with glShaderSource and then do glCompileShader. Then I instantly check the GL_COMPILE_STATUS with glGetShaderiv. If the shader is not yet compiled when I do glGetShaderiv (it simply did not have enough time to compile), is it possible that GL_COMPILE_STATUS will state that shader is not compiled? Or is it guaranteed that the GL_COMPILE_STATUS is only returned after the compilation? Or is the compilation performed on CPU and does not need to communicate with GPU? (i. e. compilation does not place any commands in GPU queue). It has never really happened to me that shader compilation failed due to time limits, it failed only because of bad shader code.
The questions are:
Is my understanding of OGL client—server model correct or does it need some adjustments?
If not all functions can be executed asynchronously, what are those functions exactly?
Why does wiki say that only render actions may be performed asynchronously?
If command execution may indeed be 'deferred' (not really the right word for it...) with glWaitSync for example, how can I upload and compile shaders the same way images are uploaded in PBO example? Or how can I perform VBO / EBO upload the same way? Or UBO? Or TBO? Non-buffer objects? Is it just uploading and then waiting for fence to be signaled?
In case it matters, I use OGL with latest GLFW github release, latest GLAD (configuration) and C++ (MinGWx64 11.2.0).
UPD: I found this answer that touches the subject of my question. However, I must specify that my question is not about where and how OGL functions are executed, it's about how to control the flow of them, i. e. control when they are executed, to perform asynchronous work of GPU and CPU if it is even possible (it seems to be, if I understand wiki page right).
I have an application that issues about 100 drawcalls per frame, each with an individual VBO. The VBOs are uploaded via glBufferData in a separate thread has has gl context resource sharing. The render thread tests buffer upload state via glClientWaitSync.
Now my question:
According to the documentation glClientWaitSync and GL_SYNC_FLUSH_COMMANDS_BIT cause a flush at every call, right? This would mean that for every not yet finished glBufferData in the upload thread I would have dozens of flushes in the render thread right? What impact on performance would it have if thus, in the worst case, I practically issue a flush before every drawcall?
The behavior of GL_SYNC_FLUSH_COMMANDS_BIT has changed from its original specification.
In the original, the use of that bit was the equivalent of issuing a glFlush before doing the wait.
However, GL 4.5 changed the wording. Now, it is the equivalent of having performed a flush immediately after you submitted that sync object. That is, instead of doing a flush relative to the current stream, it works as if you had flushed after submitting the sync. Thus, repeated use does not mean repeatedly flushing.
You can get the equivalent behavior of course by manually issuing a flush after you submit the sync object, then not using GL_SYNC_FLUSH_COMMANDS_BIT when you wait for it.
I am using exactly 3 images for the swapchain and one VkCommandBuffer (CB) per swapchain image. GPU synchronization is done with two semaphores, one for presentation_finished and one for rendering_finished. The present mode is VK_PRESENT_MODE_MAILBOX_KHR (quick overview).
Now when I am running my example without waiting for any CB fences, the validation layers report this error as soon as any swapchain image is used for the second time:
Calling vkBeginCommandBuffer() on active CB before it has completed. You must check CB fence before this call.
At first sight it seems resonable to me, as processing the commands from the CB might just not be finished yet. But the more I think about it the more I come conclude that this should not happen at all.
My current understanding is when vkAcquireImageKHR returns a specific image index it implies that the image returned has to be finished with rendering.
That is because I'm passing the rendering_finished semaphore to vkQueueSubmit to be signaled when rendering finises and to vkQueuePresentKHR to wait until it becomes signaled before presenting the image.
The specification of VkQueuePresentInfoKHR says:
pWaitSemaphores, if not VK_NULL_HANDLE, is an array of VkSemaphore objects with waitSemaphoreCount entries, and specifies the semaphores to wait for before issuing the present request
Meaning: I will never present any image which hasn't finshed rendering and thus the associated CB cannot be in use any more as soon as the image is presented.
The second semaphore presentation_finished is signalled by vkAqcuireImageKHR and passed to the same vkQueueSubmit (to start the rendering). This means the rendering of any image will start no earlier than allowed by the presentation engine.
To conclude: A present request from vkQueuePresentKHR is not issued before the rendering of the image is finished and vkAcquireImageKHR blocks until an image is available and also will never return an images currently acquired.
What am I missing that makes a fence necessary?
I have included a minimal code example containing only the conceptually important parts to illustrate the problem.
VkImage[] swapchain_images;
VkCommandBuffer[] command_buffers;
VkSemaphore rendering_finished;
VkSemaphore presentation_finished;
void RenderLoop()
{
/* Acquire an image from the swapchain. Block until one is available.
Signal presentation_finished when we are allowed to render into the image */
int index;
vkAcquireImageKHR(device, swapchain, UINT64_MAX, presentation_finished, nullptr, &index);
/* (...) Frambuffer creation, etc. */
/* Begin CB: The command pool is flagged to reset the command buffer on reuse */
VkCommandBuffer cb = command_buffers[index];
vkBeginCommandBuffer(cb, ...);
/* (...) Trivial rendering of a single color image */
/* End CB */
vkEndCommandBuffer(cb);
/* Queue the rendering and wait for presentation_finished.
When rendering is finished, signal rendering_finished.
The VkSubmitInfo will have these important members set among others:
.pWaitSemaphores = &presentation_finished;
.pSignalSemaphores = &rendering_finished;
*/
vkQueueSubmit(render_queue, &submit_info);
/* Submit the presentation request as soon as the rendering_finished
semaphore gets signalled
The VkPresentInfoKHR will have these important members set among others:
.pWaitSemaphores = &rendering_finished;
*/
vkQueuePresentKHR(present_queue, &present_info);
}
Inserting a fence when submitting the CB to the rendering queue and waiting on it before using that CB again obviously fixes the issue, but - as explained - seems redundant.
vkAcquireNextImageKHR is allowed to return an image that is still the destination and/or source of ongoing asynchronous operations. This means you have no guarantee the command buffer is available at time of reuse. It would be correct to enqueue additional, distinct command buffers to write to the acquired image, as long as those commands are configured to wait on the presentation_finished semaphore; but to safely reuse that command buffer you must wait on the fence passed to vkQueueSubmit.
See section 29.6. WSI Swapchain in the Vulkan spec with KHR extensions:
An application can acquire use of a presentable image with vkAcquireNextImageKHR. After acquiring a presentable image and before modifying it, the application must use a synchronization primitive to ensure that the presentation engine has finished reading from the image. The application can then transition the image’s layout, queue rendering commands to it, etc. Finally, the application presents the image with vkQueuePresentKHR, which releases the acquisition of the image.
See also these notes for vkAcquireNextImageKHR
When successful, vkAcquireNextImageKHR acquires a presentable image that the application can use, and sets pImageIndex to the index of that image. The presentation engine may not have finished reading from the image at the time it is acquired, so the application must use semaphore and/or fence to ensure that the image layout and contents are not modified until the presentation engine reads have completed.
[...]
As mentioned above, the presentation engine may be asynchronous with respect to the application and/or logical device. vkAcquireNextImageKHR may return as soon as it can identify which image will be acquired, and can guarantee that semaphore and fence will be signaled by the presentation engine; and may not successfully return sooner. The application uses timeout to specify how long vkAcquireNextImageKHR waits for an image to become acquired.
This shows that vkAcquireNextImageKHR is not required to block on the presentation operation, and transitively is not required to block on the graphics command that the presentation operation is itself waiting on.
I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?
Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.
I am trying to look for scenarios where Sync Objects can be used in OpenGL. My understanding is that a sync object once put in GL command stream ( using glFenceSync() ) will be signaled after all the GL commands are executed and realized.
If the sync objects are synchronization primitives why can't we MANUALLY signal them ? Where exactly this functionality can help GL programmer ?
Is the following scenario a correct one ?
Thread 1 :
Load model
Draw()
glFenceSync()
Thread 2 :
glWaitSync();
ReadPixels
Use data for subsequent operation.
Does this mean that I can't launch thread 2 unless glFenceSync() is called in Thread 1 ?
Fences are not so much meant to synchronize threads, but to know, when asynchronus operations are finished. For example if you do a glReadPixels into a pixel buffer object (PBO), you might want to know, that the read has been completed, before you even attempt to read from or map the PBO to client address space.
If you do a glReadPixels with a PBO being the target, the call will return immediately, but the data transfer may indeed take some time. That's where fences come in handy.