OpenGL: which OpenGL implementations are not pipelined? - opengl

In OpenGL wiki on Performance, it says:
"OpenGL implementations are almost always pipelined - that is to say,
things are not necessarily drawn when you tell OpenGL to draw them -
and the fact that an OpenGL call returned doesn't mean it finished
rendering."
Since it says "almost", that means there are some mplementations are not pipelined.
Here I find one:
OpenGL Pixel Buffer Object (PBO)
"Conventional glReadPixels() blocks the pipeline and waits until all
pixel data are transferred. Then, it returns control to the
application. On the contrary, glReadPixels() with PBO can schedule
asynchronous DMA transfer and returns immediately without stall.
Therefore, the application (CPU) can execute other process right away,
while transferring data with DMA by OpenGL (GPU)."
So this means conventional glReadPixels() (not with PBO) blocks the pipeline.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Then I am wondering:
which OpenGL implementations are not pipelined?
How about glDrawArrays?

The OpenGL specification itself does not specify the term "pipeline" but rather "command stream". The runtime behavior of command stream execution is deliberately left open, to give implementors maximal flexibility.
The important term is "OpenGL sychronization point": https://www.opengl.org/wiki/Synchronization
Here I find one: (Link to songho article)
Note that this is not an official OpenGL specification resource. The wording "blocks the OpenGL pipeline" is a bit unfortunate, because it gets the actual blocking and bottleneck turned "upside down". Essentially it means, that glReadPixels can only return once all the commands leading up to the image it will fetch have been executed.
So this means conventional glReadPixels() (not with PBO) blocks the pipeline. But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Actually it's not the OpenGL pipeline that gets blocked, but the execution of the program on the CPU. It means, that the GPU sees no further commands coming from the CPU. So the pipeline doesn't get "blocked" but in fact drained. When a pipeline drains, or needs to be restarted one says the pipeline has been stalled (i.e. the flow in the pipeline came to a halt).
From the GPUs point of view everything happens with maximum throughput: Render the stuff until the point glReadPixels got called, do a DMA transfer, unfortunately no further commands are available after initiating the transfer.
How about glDrawArrays?
glDrawArrays returns immediately after the data has been queued and necessary been made.

Actually it means that this specific operation can't be pipelined because all data needs to be transfered before the function returns, it doesn't mean other things can't be.
Operations like that are said to stall the pipeline. One function that will always stall the pipeline is glFinish.
Usually when the function returns a value like getting the contents of a buffer, it will induce a stall.
Depending on the driver implementation creating programs and buffers and such can be done without stalling.

Then I am wondering: which OpenGL implementations are not pipelined?
I could imagine that a pure software implementation might not be pipelined. Not much reason to queue up work if you end up executing it on the same CPU. Unless you wanted to take advantage of multi-threading.
But it's probably safe to say that any OpenGL implementation that uses dedicated hardware (commonly called GPU) will be pipelined. This allows the CPU and GPU to work in parallel, which is critical to get good system performance. Also, submitting work to the GPU incurs a certain amount of overhead, so it's beneficial to queue up work, and then submit it in larger batches.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
True. The man pages don't directly specify which calls cause a synchronization. In general, anything that returns values/data produced by the GPU causes synchronization. Examples that come to mind:
glFinish(). Explicitly requires a full synchronization, which is actually its only purpose.
glReadPixels(), in the non PBO case. The GPU has to finish rendering before you can read back the result.
glGetQueryObjectiv(id, GL_QUERY_RESULT, ...). Blocks until the GPU reaches the point where the query was submitted.
glClientWaitSync(). Waits until the GPU reaches the point where the corresponding glFenceSync() was submitted.
Note that there can be different types of synchronizations that are not directly tied to specific OpenGL calls. For example, in the case where the whole workload is GPU limited, the CPU would queue up an infinite about of work unless there is some throttling. So the driver will block the CPU at more or less arbitrary points to let the GPU catch up to a certain point. This could happen at frame boundaries, but it does not have to be. Similar synchronization can be necessary if memory runs low, or if internal driver resources are exhausted.

Related

Asynchronous rendering model in vulkan

Recently I'm reading https://github.com/ARM-software/vulkan_best_practice_for_mobile_developers/blob/master/samples/vulkan_basics.md, and it said:
OpenGL ES uses a synchronous rendering model, which means that an API call must behave as if all earlier API calls have already been processed. In reality no modern GPU works this way, rendering workloads are processed asynchronously and the synchronous model is an elaborate illusion maintained by the device driver. To maintain this illusion the driver must track which resources are read or written by each rendering operation in the queue, ensure that workloads run in a legal order to avoid rendering corruption, and ensure that API calls which need a data resource block and wait until that resource is safely available.
Vulkan uses an asynchronous rendering model, reflecting how the modern GPUs work. Applications queue rendering commands into a queue, use explicit scheduling dependencies to control workload execution order, and use explicit synchronization primitives to align dependent CPU and GPU processing.
The impact of these changes is to significantly reduce the CPU overhead of the graphics drivers, at the expense of requiring the application to handle dependency management and synchronization.
Could someone help explain why asynchronous rendering model could reduce CPU overhead? Since in Vulkan you still have to track state yourself.
Could someone help explain why asynchronous rendering model could
reduce CPU overhead?
First of all, let's get back to the original statement you are referring to, emphasis mine:
The impact of these changes is to significantly reduce the CPU
overhead of the graphics drivers, [...]
So the claim here is that the driver itself will need to consume less CPU, and it is easy to see as it can more directly forward your requests "as-is".
However, one overall goal of a low-level rendering API like Vulkan is also a potentially reduced CPU overhead in general, not only in the driver.
Consider the following example: You have a draw call which renders to a texture. And then you have another draw call which samples from this texture.
To get the implicit synchronization right, the driver has to track the usage of this texture, both as render target and as source for texture sampling operations.
It doesn't know in advance if the next draw call will need any resources which are still to be written to in previous draw calls. It has to always track every possible such conflicts, no matter if they can occur in your application or not. And it also must be extremely conservative in its decisions. It might be possible that you have a texture bound for a framebuffer for a draw call, but you may know that with the actual uniform values you set for this shaders the texture is not modified. But the GPU driver can't know that. If it can't rule - out with absolute certainty - that a resource is modified, it has to assume it is.
However, your application will more like know such details. If you have several render passes, and the second pass will depend on the texture rendered to in the first, you can (and must) add proper synchronization primitives - but the GPU driver doesn't need to care why there is any synchronization necessary at all, and it doesn't need track any resource usage to find out - it can just do as it is told. And your application also doesn't need to track it's own resource usage in many cases. It is just inherent from the usage as you coded it that a synchronization might be required at some point. There might be still cases where you need to track your own resource usage to find out though, especially if you write some intermediate layer like some more high-level graphics library where you know less and less of the structure of the rendering - then you are getting into a position similar to what a GL driver has to do (unless you want to forward all the burden of synchronization on the users of your library, like Vulkan does).

Is OpenGL's execution model synchronous, within a single context?

Given a single OpenGL context (and therefore can only be accessed by a single CPU thread at a time), if I execute two OpenGL commands, is there a guarantee that the second command will see the results of the first?
In the vast majority of cases, this is true. OpenGL commands largely behave as if all prior commands have completed fully and completely. Notable places where this matters include:
Blending. Blending is often an operation that is very sensitive to order. Blending not only works correctly between rendering commands, it works correctly within a rendering command. Triangles in a draw call are explicitly ordered, and blending will blend things in the order that the primitives appear in the draw call
Reading from a previously rendered framebuffer image. If you render to an image, you can unbind that framebuffer and bind the image as a texture and read from it, without doing anything special.
Reading data from a buffer that was used in a transform feedback operation. Nothing special needs to go between the command that generates the feedback data and the command reading it (outside of unbinding the buffer from the TF operation and binding it in the proper target for reading).
Obviously, waiting for the GPU to complete its commands before letting the CPU send more sounds incredibly slow. This is why OpenGL works under the "as if" rule: implementations must behave "as if" they were synchronous. So implementations spend a lot of time tracking which operations will produce which data, so that if you do something that will require something to wait on the GPU to produce that data, it can do so.
So you should try to avoid immediately trying to read data generated by some command. Put some distance between the generator and the consumer.
Now, I said above that this is true for "the vast majority of cases". So there are some back-doors. In no particular order:
Attempting to read from an image that you are currently using as a render target is normally forbidden. But under specific circumstances, it can be allowed, typically through the use of the glTextureBarrier command. This command ensures the execution and visibility of previously submitted rendering commands to subsequent commands. Failure to do this correctly results in undefined behavior.
The contents of buffers or images that are subject to writes (atomic or otherwise) from what we can call incoherent memory access operations. These include image store/atomic operations, SSBO store/atomic operations, and atomic counter operations. Unless you employ various tools, specific to the particulars of who is reading the data and their relationship to the writer, you will get undefined behavior.
Sync objects. By their nature, sync objects bypass the in-order execution model because... that's their point: to allow the user to be exposed directly to how the GPU executes stuff.
Asynchronous pixel transfers are an odd case. They don't actually break the in-order nature of the OpenGL memory model. But because you are reading into/writing from storage that you don't have direct access to, the implementation can hide the fact that it will take some time to read the data. So if you invoke a pixel transfer to a buffer, and then immediately try to read from the buffer, the system has to put a wait between those two commands. But if you issue a bunch of commands between the pixel transfer and the consumer of it, and those commands don't use the range being consumed, then the cost of the pixel transfer can appear to be negligible. Sync objects can be employed to know when the transfer is actually over.

Making glReadPixel() run faster

I want a really fast way to capture the content of the openGL framebuffer for my application. Generally, glReadPixels() is used for reading the content of framebuffer into a buffer. But this is slow.
I was trying to parallelise the procees of reading the framebuffer content by creating 4 threads to read framebuffer from 4 different regions using glReadPixels(). But the application is exiting due to segmentation fault. If I remove the glReadPixels() call from threads then application is running properly.
Threads do not work, abstain from that approach.
Creating several threads fails, as you have noticed, because only one thread has a current OpenGL context. In principle, you could make the context current in each worker thread before calling glReadPixels, but this will require extra synchronization from your side (otherwise, a thread could be preempted in between making the context current and reading back!), and (wgl|glx)MakeCurrent is a terribly slow function that will seriously stall OpenGL. In the end, you'll be doing more work to get something much slower.
There is no way to make glReadPixels any faster1, but you can decouple the time it takes (i.e. the readback runs asynchronously), so it does not block your application and effectively appears to run "faster".
You want to use a Pixel buffer object for that. Be sure to get the buffer flags correct.
Note that mapping the buffer to access its contents will still block if the complete contents hasn't finished transferring, so it will still not be any faster. To account for that, you either have to read the previous frame, or use a fence object which you can query to be sure that it's done.
Or, simpler but less reliable, you can insert "some other work" in between glReadPixels and accessing the data. This will not guarantee that the transfer has finished by the time you access the data, so it may still block. However, it may just work, and it will likely block for a shorter time (thus run "faster").
1 There are plenty of ways of making it slower, e.g. if you ask OpenGL to do some weird conversions or if you use wrong buffer flags. However, generally, there's no way to make it faster since its speed depends on all previous draw commands having finished before the transfer can even start, and the data being transferred over the PCIe bus (which has a fixed time overhead plus a finite bandwidth).
The only viable way of making readbacks "faster" is hiding this latency. It's of course still not faster, but you don't get to feel it.

OpenGL when can I start issuing commands again

The standards allude to rendering starting upon my first gl command and continuing in parallel to further commands. Certain functions, like glBufferSubData indicate loading can happen during rendering so long as the object is not currently in use. This introduces a logical concept of a "frame", though never explicitly mentioned in the standard.
So my question is what defines this logical frame? That is, which calls demarcate the game, such that I can start making gl calls again without interefering with the previous frame?
For example, using EGL you eventually call eglSwapBuffers (most implementations have some kind of swap command). Logically this is the boundary between one frame and the next. However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns. Yet the documentation implies you can start issuing new commands prior to its return in another thread (provided you don't touch any in-use buffers).
How can I start issuing commands to the next buffer even while the swap command is still blocking on the previous buffer? I would like to start streaming data for the next frame while the GPU is working on the old frame (in particular, I will have two vertex buffers which would be swapped each frame specifically for this purpose, and alluded to in the OpenGL documentation).
OpenGL has no concept of "frame", logical or otherwise.
OpenGL is really very simple: every command executes as if all prior commands had completed before hand.
Note the key phrase "as if". Let's say you render from a buffer object, then modify its data immediately afterwards. Like this:
glBindVertexArray(someVaoThatUsesBufferX);
glDrawArrays(...);
glBindBuffer(GL_ARRAY_BUFFER, BufferX);
glBufferSubData(GL_ARRAY_BUFFER, ...);
This is 100% legal in OpenGL. There are no caveats, questions, concerns, etc about exactly how this will function. That glBufferSubData call will execute as though the glDrawArrays command has finished.
The only thing you have to consider is the one thing the specification does not specify: performance.
An implementation is well within its rights to detect that you're modifying a buffer that may be in use, and therefore stall the CPU in glBufferSubData until the rendering from that buffer is complete. The OpenGL implementation is required to do either this or something else that prevents the actual source buffer from being modified while it is in use.
So OpenGL implementations execute commands asynchronously where possible, according to the specification. As long as the outside world cannot tell that glDrawArrays didn't finish drawing anything yet, the implementation can do whatever it wants. If you issue a glReadPixels right after the drawing command, the pipeline would have to stall. You can do it, but there is no guarantee of performance.
This is why OpenGL is defined as a closed box the way it is. This gives implementations lots of freedom to be asynchronous wherever possible. Every access of OpenGL data requires an OpenGL function call, which allows the implementation to check to see if that data is actually available yet. If not, it stalls.
Getting rid of stalls is one reason why buffer object invalidation is possible; it effectively tells OpenGL that you want to orphan the buffer's data storage. It's the reason why buffer objects can be used for pixel transfers; it allows the transfer to happen asynchronously. It's the reason why fence sync objects exist, so that you can tell whether a resource is still in use (perhaps for GL_UNSYNCHRONIZED_BIT buffer mapping). And so forth.
However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns.
Says who? The buffer swapping command may stall. It may not. It's implementation-defined, and it can be changed with certain commands. The documentation for eglSwapBuffers only says that it performs a flush, which could stall the CPU but does not have to.

opengl: glFlush() vs. glFinish()

I'm having trouble distinguishing the practical difference between calling glFlush() and glFinish().
The docs say that glFlush() and glFinish() will push all buffered operations to OpenGL so that one can be assured they will all be executed, the difference being that glFlush() returns immediately where as glFinish() blocks until all the operations are complete.
Having read the definitions, I figured that if I were to use glFlush() that I would probably run into the problem of submitting more operations to OpenGL than it could execute. So, just to try, I swapped out my glFinish() for a glFlush() and lo and behold, my program ran (as far as I could tell), the exact same; frame rates, resource usage, everything was the same.
So I'm wondering if there's much difference between the two calls, or if my code makes them run no different. Or where one should be used vs. the other.
I also figured that OpenGL would have some call like glIsDone() to check whether or not all the buffered commands for a glFlush() are complete or not (so one doesn't send operations to OpenGL faster than they can be executed), but I could find no such function.
My code is the typical game loop:
while (running) {
process_stuff();
render_stuff();
}
Mind that these commands exist since the early days of OpenGL. glFlush ensures that previous OpenGL commands must complete in finite time (OpenGL 2.1 specs, page 245). If you draw directly to the front buffer, this shall ensure that the OpenGL drivers starts drawing without too much delay. You could think of a complex scene that appears object after object on the screen, when you call glFlush after each object. However, when using double buffering, glFlush has practically no effect at all, since the changes won't be visible until you swap the buffers.
glFinish does not return until all effects from previously issued commands [...] are fully realized. This means that the execution of your program waits here until every last pixel is drawn and OpenGL has nothing more to do. If you render directly to the front buffer, glFinish is the call to make before using the operating system calls to take screenshots. It is far less useful for double buffering, because you don't see the changes you forced to complete.
So if you use double buffering, you probably won't need neither glFlush nor glFinish. SwapBuffers implicitly directs the OpenGL calls to the correct buffer, there's no need to call glFlush first. And don't mind stressing the OpenGL driver: glFlush will not choke on too many commands. It is not guaranteed that this call returns immediately (whatever that means), so it can take any time it needs to process your commands.
As the other answers have hinted, there really is no good answer as per the spec. The general intent of glFlush() is that after calling it, the host CPU will have no OpenGL-related work to do -- the commands will have been pushed to the graphics hardware. The general intent of glFinish() is that after it returns, no remaining work is left, and the results should be available too all appropriate non-OpenGL APIs (e.g. reads from the framebuffer, screenshots, etc...). Whether that is really what happens is driver-dependent. The specification allows a ton of latitude as to what is legal.
I was always confused about those two commands too, but this image made it all clear to me:
Apparently some GPU drivers don't send the issued commands to the hardware unless a certain number of commands has been accumulated. In this example that number is 5.
The image shows various OpenGL commands (A, B, C, D, E...) that have been issued. As we can see at the top, the commands don't get issued yet, because the queue isn't full yet.
In the middle we see how glFlush() affects the queued up commands. It tells the driver to send all queued up commands to the hardware (even if the queue isn't full yet). This doesn't block the calling thread. It merely signals the driver that we might not be sending any additional commands. Therefore waiting for the queue to fill up would be a waste of time.
At the bottom we see an example using glFinish(). It almost does the same thing as glFlush(), except that it makes the calling thread wait till all commands have been processed by the hardware.
Image taken from the book "Advanced Graphics Programming Using OpenGL".
If you did not see any performance difference, it means you're doing something wrong. As some others mentioned, you don't need to call either, but if you do call glFinish, then you're automatically losing the parallelism that the GPU and CPU can achieve. Let me dive deeper:
In practice, all the work you submit to the driver is batched, and sent to the hardware potentially way later (e.g. at SwapBuffer time).
So, if you're calling glFinish, you're essentially forcing the driver to push the commands to the GPU (that it batched till then, and never asked the GPU to work on), and stall the CPU until the commands pushed are completely executed. So during the whole time the GPU works, the CPU does not (at least on this thread). And all the time the CPU does its work (mostly batching commands), the GPU does not do anything. So yeah, glFinish should hurt your performance. (This is an approximation, as drivers may start having the GPU work on some commands if a lot of them were already batched. It's not typical though, as the command buffers tend to be big enough to hold quite a lot of commands).
Now, why would you call glFinish at all then ? The only times I've used it were when I had driver bugs. Indeed, if one of the commands you send down to the hardware crashes the GPU, then your simplest option to identify which command is the culprit is to call glFinish after each Draw. That way, you can narrow down what exactly triggers the crash
As a side note, APIs like Direct3D don't support a Finish concept at all.
Have a look here. In short, it says:
glFinish() has the same effect as glFlush(), with the addition that glFinish() will block until all commands submitted have been executed.
Another article describes other differences:
Swap functions (used in double-buffered applications) automatically flush the commands, so no need to call glFlush
glFinish forces OpenGL to perform outstanding commands, which is a bad idea (e.g. with VSync)
To sum up, this means that you don't even need these functions when using double buffering, except if your swap-buffers implementation doesn't automatically flush the commands.
glFlush really dates back to a client server model. You send all gl commands through a pipe to a gl server. That pipe might buffer. Just like any file or network i/o might buffer. glFlush only says "send the buffer now, even if it is not full yet!". On a local system this is almost never needed because a local OpenGL API is unlikely to buffer itself and just issues commands directly. Also all commands that cause actual rendering will do an implicit flush.
glFinish on the other hand was made for performance measurement. Kind of a PING to the GL server. It roundtrips a command and waits until the server responds "I am idle".
Nowadays modern, local drivers have quite creative ideas what it means to be idle though. Is it "all pixels are drawn" or "my command queue has space"? Also because many old programs sprinkled glFlush and glFinish throughout their code without reason as voodoo coding many modern drivers just ignore them as an "optimization". Can't blame them for that, really.
So in summary: Treat both glFinish and glFlush as no ops in practice unless you are coding for an ancient remote SGI OpenGL server.
There doesn't seem to be a way of querying the status of the buffer. There is this Apple extension which could serve the same purpose, but it doesn't seem cross-platform (haven't tried it.) At it quick glance, it seems prior to flush you'd push the fence command in; you can then query the status of that fence as it moves through the buffer.
I wonder if you could use flush prior to buffering up commands, but prior to beginning to render the next frame you call finish. This would allow you to begin processing the next frame as the GPU works, but if it's not done by the time you get back, finish will block to make sure everything's in a fresh state.
I haven't tried this, but I will shortly.
I have tried it on an old application that has pretty even CPU & GPU use. (It originally used finish.)
When I changed it to flush at end and finish at begin, there were no immediate problems. (Everything looked fine!) The responsiveness of the program increased, probably because the CPU wasn't stalled waiting on the GPU. Definitely a better method.
For comparison, I removed finished from the start of the frame, leaving flush, and it performed the same.
So I would say use flush and finish, because when the buffer is empty at the call to finish, there is no performance hit. And I'm guessing if the buffer were full you should want to finish anyway.
The question is: do you want your code to continue running while the OpenGL commands are being executed, or only to run after your OpenGL commands has been executed.
This can matter in cases, like over network delays, to have certain console output only after the images have been drawn or the such.