I was wondering if I could assume that all buffer related GPU operations such as:
glDrawElements
glBufferData
glSubBufferData
glUnmapBuffer
are guaranteed to be completed after swap buffer is performed (i.e. frame is finished) assuming vsync is on.
I'm confused as I've come across implementations of vertex streaming techniques such as rond robin vbo which imply that a vbo could still be in use during the next frame.
What I basically want to do is stream vertices through glMapBufferRange with GL_UNSYNCHRONIZED_BIT, managing the correct ranges myself so that writes and reads never overlap. This would work very well if I could just assume synchronization and reset the stream range index at the end of the frame.
In other words, does swap buffer with vsynch guarantee synchronization?
glDrawElements glBufferData glSubBufferData glUnmapBuffer are guaranteed to be completed after swap buffer is performed (i.e. frame is finished) assuming vsync is on.
No; that would be terrible for performance. That would basically impose a full GPU/CPU synchronization simply because someone wants to display a new image. Even though both the production of that image and its display are GPU processes (or at least, not necessarily synchronous with your CPU thread/process).
The point of vsync is to ensure that the new image doesn't get swapped in until the vertical synchronization period, to avoid visual tearing of the image, where half of the display comes from the old and half from the new. This is not about ensuring that anything has actually completed on the GPU relative to CPU execution.
If you are streaming data into buffer objects via persistent mapping (which should be preferred over older "unsychronized" shenanigans), then you need to perform the synchronization yourself. Set a fence sync object after you have issued the rendering commands that will use data from the buffer region you wrote to. Then when it comes time to try to write to that buffer region again, check the fence sync and wait until its available. This also gives you the freedom to expand the number of such buffer regions you have if rendering is consistently delayed.
Related
Given a single OpenGL context (and therefore can only be accessed by a single CPU thread at a time), if I execute two OpenGL commands, is there a guarantee that the second command will see the results of the first?
In the vast majority of cases, this is true. OpenGL commands largely behave as if all prior commands have completed fully and completely. Notable places where this matters include:
Blending. Blending is often an operation that is very sensitive to order. Blending not only works correctly between rendering commands, it works correctly within a rendering command. Triangles in a draw call are explicitly ordered, and blending will blend things in the order that the primitives appear in the draw call
Reading from a previously rendered framebuffer image. If you render to an image, you can unbind that framebuffer and bind the image as a texture and read from it, without doing anything special.
Reading data from a buffer that was used in a transform feedback operation. Nothing special needs to go between the command that generates the feedback data and the command reading it (outside of unbinding the buffer from the TF operation and binding it in the proper target for reading).
Obviously, waiting for the GPU to complete its commands before letting the CPU send more sounds incredibly slow. This is why OpenGL works under the "as if" rule: implementations must behave "as if" they were synchronous. So implementations spend a lot of time tracking which operations will produce which data, so that if you do something that will require something to wait on the GPU to produce that data, it can do so.
So you should try to avoid immediately trying to read data generated by some command. Put some distance between the generator and the consumer.
Now, I said above that this is true for "the vast majority of cases". So there are some back-doors. In no particular order:
Attempting to read from an image that you are currently using as a render target is normally forbidden. But under specific circumstances, it can be allowed, typically through the use of the glTextureBarrier command. This command ensures the execution and visibility of previously submitted rendering commands to subsequent commands. Failure to do this correctly results in undefined behavior.
The contents of buffers or images that are subject to writes (atomic or otherwise) from what we can call incoherent memory access operations. These include image store/atomic operations, SSBO store/atomic operations, and atomic counter operations. Unless you employ various tools, specific to the particulars of who is reading the data and their relationship to the writer, you will get undefined behavior.
Sync objects. By their nature, sync objects bypass the in-order execution model because... that's their point: to allow the user to be exposed directly to how the GPU executes stuff.
Asynchronous pixel transfers are an odd case. They don't actually break the in-order nature of the OpenGL memory model. But because you are reading into/writing from storage that you don't have direct access to, the implementation can hide the fact that it will take some time to read the data. So if you invoke a pixel transfer to a buffer, and then immediately try to read from the buffer, the system has to put a wait between those two commands. But if you issue a bunch of commands between the pixel transfer and the consumer of it, and those commands don't use the range being consumed, then the cost of the pixel transfer can appear to be negligible. Sync objects can be employed to know when the transfer is actually over.
I whipped up a simple C program (on github) that uses OpenGL to draw a bunch of triangles from a buffer that was allocated with glBufferStorage like so:
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
GLbitfield bufferStorageFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_ARRAY_BUFFER, vboSize, 0, bufferStorageFlags);
vert *triData = glMapBufferRange(GL_ARRAY_BUFFER, 0, vboSize, bufferStorageFlags);
I am aware that synchronization is my responsibility when using glBufferStorage with MAP_PERSISTENT_BIT, but I'm not sure exactly what I need to protect against.
The only time I touch triData is before calling glDrawArrays on it, and after calling SDL_GL_SwapWindow, so I know drawing of the last frame is done, and I haven't called for drawing of this frame to begin yet.
This appears to work perfectly, even with vsync disabled.
The wiki says:
Swapping the back and front buffers on the Default Framebuffer may
cause some form of synchronization ... if there are still commands
affecting the default framebuffer that have not yet completed.
Swapping buffers only technically needs to sync to the last command
that affects the default framebuffer, but it may perform a full
glFinish​.
But every article I've read on the subject makes extensive use of GLsync
pointers, though maybe they were just assuming I might want to use the buffer in more complex ways?
For now, am I right to believe SDL_GL_SwapWindow is providing sufficient synchronization?
The previous answers are correct in saying that you do need synchronization even after you use a swap. But I wanted to make even clearer that this is more than just a theoretical concern.
Swap operations are typically not synchronous. It's very common to let the rendering get 1-2 frames ahead of the display. This is done to reduce "bubbles" where the GPU temporarily goes into an idle state. If your swap call were synchronous, the GPU would unavoidably be idle at the time it returns, since all previously submitted work would have completed. Even if you immediately started rendering again, it would take a little time for that work to actually reach the GPU for execution. So you have times where the GPU does nothing, which hurts performance at least as long as your rendering is entirely GPU limited.
Now, you obviously don't want the rendering to get too far ahead of the display. Undesired side effects of that would be increased latency in responding to user input (which is a big deal for games), and excessive memory usage for queued up rendering commands. Therefore, there needs to be throttling before this happens. This throttling is often applied as part of swap operations, but it can potentially happen almost anywere.
So if you measure the wall clock time taken for a swap call to return, it's fairly common for it to be long enough to suggest that it's blocking. But this does not suggest that the call itself is synchronous. It may just be blocking until a previous frame completes, to prevent the rendering to get too far ahead of the display.
Here's my favorite advice about any multithreaded/asynchronous code:
If multithreaded code isn't immediately, obviously, provably correct then it is almost certainly wrong.
You cannot prove that OpenGL will not read from a value you are writing to. Therefore, it is wrong, even if no problems are apparent.
Yes, you need to do explicit synchronization. Even though you coherently mapped the buffer, you still cannot change the values in it while OpenGL might be reading from them. You must wait until after the last call that reads from that data before writing to it again. And the only ways that OpenGL has to wait for it to get finished is either glFinish or glClientWaitSync.
I am aware that synchronization is my responsibility when using glBufferStorage,
No, not necessarily. A buffer created with glBufferStorage is no different than a buffer created with glBuffer, except for the fact that you can't re-specify it.
You only need to do manual synchronization when mapping with the MAP_PERSISTENT_BIT (which was included in the same extension that glBufferStorage was, ARB_buffer_storage).
Through visual studio profiling I see that nearly 50% of my program's execution time is spent in KernalBase.dll. What's it doing in there? I don't know, but the main thing calling it is nvoglv64.dll. To my understanding, nvoglv64.dll is the opengl driver. And the main thing that is calling nvoglv64.dll is one of my functions. This is the function.
draw() {
if (mapped)
{
mapped = false;
glBindBuffer(GL_ARRAY_BUFFER, triangleBuffer);
glUnmapBuffer(GL_ARRAY_BUFFER);
}
glBindVertexArray(trianglesVAO);
program.bind();
glDrawElements(GL_TRIANGLES, ...);
}
The way I use this function is as follows: I asynchronously map a gl buffer to client memory, and fill it up with a large amount of triangles. Then I draw the buffer using the above function. Except, I use two buffers, each frame I swap which one is being drawn with and which one is being written to.
It's suppose to all be asynchronous. Even glunmap and gldrawelements is suppose to be asynchronous, they should just get put in a command queue. What is causing the slow down?
Through visual studio profiling I see that nearly 50% of my program's execution time is spent in KernalBase.dll. What's it doing in there?
Mapping, unmapping and checking the status of memory as you would expect.
If you want everything to truly be asynchronous, and run the risk of clobbering data that has not been rendered yet, you can map unsynchronized buffers (see glMapBufferRange (...)).
Otherwise, there is some internal driver synchronization to prevent you from modifying memory that has not been drawn yet and that is what you are seeing here. You cannot keep everything asynchronous unless you have enough buffers to accommodate every pending frame.
Now, the thing here is what you just described (2 buffers) is not adequate. You need multiple levels of buffers to prevent pipeline stalls - the driver is (usually) allowed to queue up more than 1 frame worth of commands and the CPU will not be allowed to change memory used by any pending frame (the driver will block when you attempt to call glMapBuffer (...)) until said frame is finished.
3 buffers is a good compromise; you might still incur synchronization overhead every once in a while if the CPU's > 2 frames ahead of the GPU. This situation (> 2 pre-rendered frames) indicates that you are GPU-bound, and CPU/driver synchronization for 1 frame will not actually change that. So you can probably afford to block the render thread.
In OpenGL wiki on Performance, it says:
"OpenGL implementations are almost always pipelined - that is to say,
things are not necessarily drawn when you tell OpenGL to draw them -
and the fact that an OpenGL call returned doesn't mean it finished
rendering."
Since it says "almost", that means there are some mplementations are not pipelined.
Here I find one:
OpenGL Pixel Buffer Object (PBO)
"Conventional glReadPixels() blocks the pipeline and waits until all
pixel data are transferred. Then, it returns control to the
application. On the contrary, glReadPixels() with PBO can schedule
asynchronous DMA transfer and returns immediately without stall.
Therefore, the application (CPU) can execute other process right away,
while transferring data with DMA by OpenGL (GPU)."
So this means conventional glReadPixels() (not with PBO) blocks the pipeline.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Then I am wondering:
which OpenGL implementations are not pipelined?
How about glDrawArrays?
The OpenGL specification itself does not specify the term "pipeline" but rather "command stream". The runtime behavior of command stream execution is deliberately left open, to give implementors maximal flexibility.
The important term is "OpenGL sychronization point": https://www.opengl.org/wiki/Synchronization
Here I find one: (Link to songho article)
Note that this is not an official OpenGL specification resource. The wording "blocks the OpenGL pipeline" is a bit unfortunate, because it gets the actual blocking and bottleneck turned "upside down". Essentially it means, that glReadPixels can only return once all the commands leading up to the image it will fetch have been executed.
So this means conventional glReadPixels() (not with PBO) blocks the pipeline. But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Actually it's not the OpenGL pipeline that gets blocked, but the execution of the program on the CPU. It means, that the GPU sees no further commands coming from the CPU. So the pipeline doesn't get "blocked" but in fact drained. When a pipeline drains, or needs to be restarted one says the pipeline has been stalled (i.e. the flow in the pipeline came to a halt).
From the GPUs point of view everything happens with maximum throughput: Render the stuff until the point glReadPixels got called, do a DMA transfer, unfortunately no further commands are available after initiating the transfer.
How about glDrawArrays?
glDrawArrays returns immediately after the data has been queued and necessary been made.
Actually it means that this specific operation can't be pipelined because all data needs to be transfered before the function returns, it doesn't mean other things can't be.
Operations like that are said to stall the pipeline. One function that will always stall the pipeline is glFinish.
Usually when the function returns a value like getting the contents of a buffer, it will induce a stall.
Depending on the driver implementation creating programs and buffers and such can be done without stalling.
Then I am wondering: which OpenGL implementations are not pipelined?
I could imagine that a pure software implementation might not be pipelined. Not much reason to queue up work if you end up executing it on the same CPU. Unless you wanted to take advantage of multi-threading.
But it's probably safe to say that any OpenGL implementation that uses dedicated hardware (commonly called GPU) will be pipelined. This allows the CPU and GPU to work in parallel, which is critical to get good system performance. Also, submitting work to the GPU incurs a certain amount of overhead, so it's beneficial to queue up work, and then submit it in larger batches.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
True. The man pages don't directly specify which calls cause a synchronization. In general, anything that returns values/data produced by the GPU causes synchronization. Examples that come to mind:
glFinish(). Explicitly requires a full synchronization, which is actually its only purpose.
glReadPixels(), in the non PBO case. The GPU has to finish rendering before you can read back the result.
glGetQueryObjectiv(id, GL_QUERY_RESULT, ...). Blocks until the GPU reaches the point where the query was submitted.
glClientWaitSync(). Waits until the GPU reaches the point where the corresponding glFenceSync() was submitted.
Note that there can be different types of synchronizations that are not directly tied to specific OpenGL calls. For example, in the case where the whole workload is GPU limited, the CPU would queue up an infinite about of work unless there is some throttling. So the driver will block the CPU at more or less arbitrary points to let the GPU catch up to a certain point. This could happen at frame boundaries, but it does not have to be. Similar synchronization can be necessary if memory runs low, or if internal driver resources are exhausted.
I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.