Is OpenGL's execution model synchronous, within a single context? - opengl

Given a single OpenGL context (and therefore can only be accessed by a single CPU thread at a time), if I execute two OpenGL commands, is there a guarantee that the second command will see the results of the first?

In the vast majority of cases, this is true. OpenGL commands largely behave as if all prior commands have completed fully and completely. Notable places where this matters include:
Blending. Blending is often an operation that is very sensitive to order. Blending not only works correctly between rendering commands, it works correctly within a rendering command. Triangles in a draw call are explicitly ordered, and blending will blend things in the order that the primitives appear in the draw call
Reading from a previously rendered framebuffer image. If you render to an image, you can unbind that framebuffer and bind the image as a texture and read from it, without doing anything special.
Reading data from a buffer that was used in a transform feedback operation. Nothing special needs to go between the command that generates the feedback data and the command reading it (outside of unbinding the buffer from the TF operation and binding it in the proper target for reading).
Obviously, waiting for the GPU to complete its commands before letting the CPU send more sounds incredibly slow. This is why OpenGL works under the "as if" rule: implementations must behave "as if" they were synchronous. So implementations spend a lot of time tracking which operations will produce which data, so that if you do something that will require something to wait on the GPU to produce that data, it can do so.
So you should try to avoid immediately trying to read data generated by some command. Put some distance between the generator and the consumer.
Now, I said above that this is true for "the vast majority of cases". So there are some back-doors. In no particular order:
Attempting to read from an image that you are currently using as a render target is normally forbidden. But under specific circumstances, it can be allowed, typically through the use of the glTextureBarrier command. This command ensures the execution and visibility of previously submitted rendering commands to subsequent commands. Failure to do this correctly results in undefined behavior.
The contents of buffers or images that are subject to writes (atomic or otherwise) from what we can call incoherent memory access operations. These include image store/atomic operations, SSBO store/atomic operations, and atomic counter operations. Unless you employ various tools, specific to the particulars of who is reading the data and their relationship to the writer, you will get undefined behavior.
Sync objects. By their nature, sync objects bypass the in-order execution model because... that's their point: to allow the user to be exposed directly to how the GPU executes stuff.
Asynchronous pixel transfers are an odd case. They don't actually break the in-order nature of the OpenGL memory model. But because you are reading into/writing from storage that you don't have direct access to, the implementation can hide the fact that it will take some time to read the data. So if you invoke a pixel transfer to a buffer, and then immediately try to read from the buffer, the system has to put a wait between those two commands. But if you issue a bunch of commands between the pixel transfer and the consumer of it, and those commands don't use the range being consumed, then the cost of the pixel transfer can appear to be negligible. Sync objects can be employed to know when the transfer is actually over.

Related

How to synchronize data for OpenGL compute shader soft body simulation

I'm trying to make a soft body physics simulation, using OpenGL compute shaders. I'm using a spring/mass model, where objects are modeled as being made out of a mesh of particles, connected by springs (Wikipedia link with more details). My plan is to have a big SSBO that stores the positions, velocities, and net forces for each particle. I'll have a compute shader that, for each spring, calculates the force between the particles on both ends of that spring (using Hook's law) and adds that to the net forces for those two particles. Then I'll have another compute shader that, for each particle, does some sort of Euler integration using the data from the SSBO, and then zeros the net forces for the next frame.
My problem is with memory synchronization. Each particle is attached to more than one spring, so in the first compute shader, different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order, but I'm unfamiliar with how OpenGL memory works, and I'm not sure how to avoid race conditions. In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) between the calls to the spring and particle compute shaders, so that the data written by the former is visible to latter. Is this necessary/sufficient?
I'm afraid to experiment, because I'm not sure what protections OpenGL gives against undefined behavior and accidentally screwing up your computer.
different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order
In this case you would need to use atomicAdd() in GLSL to make sure two separate threads don't get into a race condition.
In your case I don't think this will be a performance issue, but you should be aware that atomicAdd() can cause a big slowdown in cases where many threads are hitting the same location in memory at the same time (they have to serialize and wait for eachother). This performance issue is called "contention", and depending on the problem, you can usually improve it a lot by using warp-level primitives to make sure only 1 thread within each warp needs to actually commit the atomicAdd() (or other atomic operation).
Also "warps" are Nvidia terminology, AMD calls them "wavefronts", and there are different names still on other hardware vendors and API's.
In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT)
This is correct. Conceptually the way I think about it is that OpenGL compute shaders are async by default. This means, when you launch a compute shader, there's no guarantee when it will execute relative to subsequent commands. glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) will basically create a wait() between any draw/compute commands accessing that type of resource.

OpenGL: which OpenGL implementations are not pipelined?

In OpenGL wiki on Performance, it says:
"OpenGL implementations are almost always pipelined - that is to say,
things are not necessarily drawn when you tell OpenGL to draw them -
and the fact that an OpenGL call returned doesn't mean it finished
rendering."
Since it says "almost", that means there are some mplementations are not pipelined.
Here I find one:
OpenGL Pixel Buffer Object (PBO)
"Conventional glReadPixels() blocks the pipeline and waits until all
pixel data are transferred. Then, it returns control to the
application. On the contrary, glReadPixels() with PBO can schedule
asynchronous DMA transfer and returns immediately without stall.
Therefore, the application (CPU) can execute other process right away,
while transferring data with DMA by OpenGL (GPU)."
So this means conventional glReadPixels() (not with PBO) blocks the pipeline.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Then I am wondering:
which OpenGL implementations are not pipelined?
How about glDrawArrays?
The OpenGL specification itself does not specify the term "pipeline" but rather "command stream". The runtime behavior of command stream execution is deliberately left open, to give implementors maximal flexibility.
The important term is "OpenGL sychronization point": https://www.opengl.org/wiki/Synchronization
Here I find one: (Link to songho article)
Note that this is not an official OpenGL specification resource. The wording "blocks the OpenGL pipeline" is a bit unfortunate, because it gets the actual blocking and bottleneck turned "upside down". Essentially it means, that glReadPixels can only return once all the commands leading up to the image it will fetch have been executed.
So this means conventional glReadPixels() (not with PBO) blocks the pipeline. But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Actually it's not the OpenGL pipeline that gets blocked, but the execution of the program on the CPU. It means, that the GPU sees no further commands coming from the CPU. So the pipeline doesn't get "blocked" but in fact drained. When a pipeline drains, or needs to be restarted one says the pipeline has been stalled (i.e. the flow in the pipeline came to a halt).
From the GPUs point of view everything happens with maximum throughput: Render the stuff until the point glReadPixels got called, do a DMA transfer, unfortunately no further commands are available after initiating the transfer.
How about glDrawArrays?
glDrawArrays returns immediately after the data has been queued and necessary been made.
Actually it means that this specific operation can't be pipelined because all data needs to be transfered before the function returns, it doesn't mean other things can't be.
Operations like that are said to stall the pipeline. One function that will always stall the pipeline is glFinish.
Usually when the function returns a value like getting the contents of a buffer, it will induce a stall.
Depending on the driver implementation creating programs and buffers and such can be done without stalling.
Then I am wondering: which OpenGL implementations are not pipelined?
I could imagine that a pure software implementation might not be pipelined. Not much reason to queue up work if you end up executing it on the same CPU. Unless you wanted to take advantage of multi-threading.
But it's probably safe to say that any OpenGL implementation that uses dedicated hardware (commonly called GPU) will be pipelined. This allows the CPU and GPU to work in parallel, which is critical to get good system performance. Also, submitting work to the GPU incurs a certain amount of overhead, so it's beneficial to queue up work, and then submit it in larger batches.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
True. The man pages don't directly specify which calls cause a synchronization. In general, anything that returns values/data produced by the GPU causes synchronization. Examples that come to mind:
glFinish(). Explicitly requires a full synchronization, which is actually its only purpose.
glReadPixels(), in the non PBO case. The GPU has to finish rendering before you can read back the result.
glGetQueryObjectiv(id, GL_QUERY_RESULT, ...). Blocks until the GPU reaches the point where the query was submitted.
glClientWaitSync(). Waits until the GPU reaches the point where the corresponding glFenceSync() was submitted.
Note that there can be different types of synchronizations that are not directly tied to specific OpenGL calls. For example, in the case where the whole workload is GPU limited, the CPU would queue up an infinite about of work unless there is some throttling. So the driver will block the CPU at more or less arbitrary points to let the GPU catch up to a certain point. This could happen at frame boundaries, but it does not have to be. Similar synchronization can be necessary if memory runs low, or if internal driver resources are exhausted.

Making glReadPixel() run faster

I want a really fast way to capture the content of the openGL framebuffer for my application. Generally, glReadPixels() is used for reading the content of framebuffer into a buffer. But this is slow.
I was trying to parallelise the procees of reading the framebuffer content by creating 4 threads to read framebuffer from 4 different regions using glReadPixels(). But the application is exiting due to segmentation fault. If I remove the glReadPixels() call from threads then application is running properly.
Threads do not work, abstain from that approach.
Creating several threads fails, as you have noticed, because only one thread has a current OpenGL context. In principle, you could make the context current in each worker thread before calling glReadPixels, but this will require extra synchronization from your side (otherwise, a thread could be preempted in between making the context current and reading back!), and (wgl|glx)MakeCurrent is a terribly slow function that will seriously stall OpenGL. In the end, you'll be doing more work to get something much slower.
There is no way to make glReadPixels any faster1, but you can decouple the time it takes (i.e. the readback runs asynchronously), so it does not block your application and effectively appears to run "faster".
You want to use a Pixel buffer object for that. Be sure to get the buffer flags correct.
Note that mapping the buffer to access its contents will still block if the complete contents hasn't finished transferring, so it will still not be any faster. To account for that, you either have to read the previous frame, or use a fence object which you can query to be sure that it's done.
Or, simpler but less reliable, you can insert "some other work" in between glReadPixels and accessing the data. This will not guarantee that the transfer has finished by the time you access the data, so it may still block. However, it may just work, and it will likely block for a shorter time (thus run "faster").
1 There are plenty of ways of making it slower, e.g. if you ask OpenGL to do some weird conversions or if you use wrong buffer flags. However, generally, there's no way to make it faster since its speed depends on all previous draw commands having finished before the transfer can even start, and the data being transferred over the PCIe bus (which has a fixed time overhead plus a finite bandwidth).
The only viable way of making readbacks "faster" is hiding this latency. It's of course still not faster, but you don't get to feel it.

OpenGL when can I start issuing commands again

The standards allude to rendering starting upon my first gl command and continuing in parallel to further commands. Certain functions, like glBufferSubData indicate loading can happen during rendering so long as the object is not currently in use. This introduces a logical concept of a "frame", though never explicitly mentioned in the standard.
So my question is what defines this logical frame? That is, which calls demarcate the game, such that I can start making gl calls again without interefering with the previous frame?
For example, using EGL you eventually call eglSwapBuffers (most implementations have some kind of swap command). Logically this is the boundary between one frame and the next. However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns. Yet the documentation implies you can start issuing new commands prior to its return in another thread (provided you don't touch any in-use buffers).
How can I start issuing commands to the next buffer even while the swap command is still blocking on the previous buffer? I would like to start streaming data for the next frame while the GPU is working on the old frame (in particular, I will have two vertex buffers which would be swapped each frame specifically for this purpose, and alluded to in the OpenGL documentation).
OpenGL has no concept of "frame", logical or otherwise.
OpenGL is really very simple: every command executes as if all prior commands had completed before hand.
Note the key phrase "as if". Let's say you render from a buffer object, then modify its data immediately afterwards. Like this:
glBindVertexArray(someVaoThatUsesBufferX);
glDrawArrays(...);
glBindBuffer(GL_ARRAY_BUFFER, BufferX);
glBufferSubData(GL_ARRAY_BUFFER, ...);
This is 100% legal in OpenGL. There are no caveats, questions, concerns, etc about exactly how this will function. That glBufferSubData call will execute as though the glDrawArrays command has finished.
The only thing you have to consider is the one thing the specification does not specify: performance.
An implementation is well within its rights to detect that you're modifying a buffer that may be in use, and therefore stall the CPU in glBufferSubData until the rendering from that buffer is complete. The OpenGL implementation is required to do either this or something else that prevents the actual source buffer from being modified while it is in use.
So OpenGL implementations execute commands asynchronously where possible, according to the specification. As long as the outside world cannot tell that glDrawArrays didn't finish drawing anything yet, the implementation can do whatever it wants. If you issue a glReadPixels right after the drawing command, the pipeline would have to stall. You can do it, but there is no guarantee of performance.
This is why OpenGL is defined as a closed box the way it is. This gives implementations lots of freedom to be asynchronous wherever possible. Every access of OpenGL data requires an OpenGL function call, which allows the implementation to check to see if that data is actually available yet. If not, it stalls.
Getting rid of stalls is one reason why buffer object invalidation is possible; it effectively tells OpenGL that you want to orphan the buffer's data storage. It's the reason why buffer objects can be used for pixel transfers; it allows the transfer to happen asynchronously. It's the reason why fence sync objects exist, so that you can tell whether a resource is still in use (perhaps for GL_UNSYNCHRONIZED_BIT buffer mapping). And so forth.
However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns.
Says who? The buffer swapping command may stall. It may not. It's implementation-defined, and it can be changed with certain commands. The documentation for eglSwapBuffers only says that it performs a flush, which could stall the CPU but does not have to.

How exactly is GLSL's "coherent" memory qualifier interpreted by GPU drivers for multi-pass rendering?

The GLSL specification states, for the "coherent" memory qualifier: "memory variable where reads and writes are coherent with reads and writes from other shader invocations".
In practice, I'm unsure how this is interpreted by modern-day GPU drivers with regards to multiple rendering passes. When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes? For my purposes, I define a pass as a "glBindFramebuffer-glViewPort-glUseProgram-glDrawABC-glDrawXYZ-glDraw123" cycle; where I'm currently executing 2 such passes per "render loop iteration" but may have more per iteration later on.
When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes?
It means exactly what it says: "other shader invocations". It could be the same program code. It could be different code. It doesn't matter: shader invocations that aren't this one.
Normally, OpenGL handles synchronization for you, because OpenGL can track this fairly easily. If you map a range of a buffer object, modify it, and unmap it, OpenGL knows how much stuff you've (potentially) changed. If you use glTexSubImage2D, it knows how much stuff you changed. If you do transform feedback, it can know exactly how much data was written to the buffer.
If you do transform feedback into a buffer, then bind it as a source for vertex data, OpenGL knows that this will stall the pipeline. That it must wait until the transform feedback has completed, and then clear some caches in order to use the buffer for vertex data.
When you're dealing with image load/store, you lose all of this. Because so much could be written in a completely random, unknown, and unknowable fashion, OpenGL generally plays really loose with the rules in order to allow you flexibility to get the maximum possible performance. This triggers a lot of undefined behavior.
In general, the only rules you can follow are those outlined in section 2.11.13 of the OpenGL 4.2 specification. The biggest one (for shader-to-shader talk) is the rule on stages. If you're in a fragment shader, it is safe to assume that the vertex shader(s) that specifically computed the point/line/triangle for your triangle have completed. Therefore, you can freely load values that were stored by them. But only from the ones that made you.
Your shaders cannot make assumptions that shaders executed in previous rendering commands have completed (I know that sounds odd, given what was just said, but remember: "only from the ones that made you"). Your shaders cannot make assumptions that other invocations of the same shader, using the same uniforms, images, textures, etc, in the same rendering command have completed, except where the above applies.
The only thing you can assume is that writes your shader instance itself made are visible... to itself. So if you do an imageStore and do an imageLoad to the same memory location through the same variable, then you are guaranteed to get the same value back.
Well, unless someone else wrote to it in the meantime.
Your shaders cannot assume that a later rendering command will certainly fetch values written (via image store or atomic updates) by a previous one. No matter how much later! It doesn't matter what you've bound to the context. It doesn't matter what you've uploaded or downloaded (technically. Odds are you'll get correct behavior in some cases, but undefined behavior is still undefined).
If you need that guarantee, if you need to issue a rendering command that will fetch values written by image store/atomic updates, you must explicitly ask synchronize memory sometime after issuing the writing call and before issuing the reading call. This is done with glMemoryBarrier.
Therefore, if you render something that does image storing, you cannot render something that uses the stored data until an appropriate barrier has been sent (either explicitly in the shader or explicitly in OpenGL code). This could be an image load operation. But it could be rendering from a buffer object written by shader code. It could be a texture fetch. It could be doing blending to an image attached to the FBO. It doesn't matter; you can't do it.
Note that this applies for all operations that deal with image load/store/atomic, not just shader operations. So if you use image store to write to an image, you won't necessarily read the right data unless you use a GL_TEXTURE_UPDATE_BARRIER_BIT​ barrier.