Semantics of barrier() in opengl compute shader - opengl

Let's say I have an opengl compute shader written in GLSL, executing on a NVidia Geforce 970.
At the start of the shader, a single invocation writes to a "Shader Storage Buffer Object" (SSBO).
I then issue a suitable barrier, like memoryBarrier() in my GLSL.
I then read from the memory written in the first step, in each invocation.
Will that first write be visible to all invocations in the current compute operation?
At https://www.khronos.org/opengl/wiki/Memory_Model#Ensuring_visibility , Khronos say:
"Use coherent and an appropriate memoryBarrier* or groupMemoryBarrier call if you use a mechanism like barrier to synchronize between invocations."
I'm pretty sure it's possible to synchronize this way within a work group. But does it work for all invocations in every work group, in the entire compute operation?
I'm unsure how an entire set of work groups is scheduled. I would expect them to possibly run sequentially, making the kind of synchronization I'm asking about impossible?

But does it work for all invocations in every work group, in the entire compute operation?
No. The scope of barrier is explicitly within a work group. And you cannot have visibility of operations that you haven't ensured have happened yet. The order of execution of work groups with respect to one another is undefined, so you don't know if one work group has executed yet.
What you want isn't really possible. You need instead to change how your shaders work so that work groups are not dependent on each other. In this case, you can have every work group perform this computation. And instead of storing it in global memory via an SSBO, store the result in a shared variable.
Yes, you'll be computing the same value in each group. But that will yield better performance than having all of those work groups wait on one work group. Especially since that's not something you can actually do.

Related

How to synchronize data for OpenGL compute shader soft body simulation

I'm trying to make a soft body physics simulation, using OpenGL compute shaders. I'm using a spring/mass model, where objects are modeled as being made out of a mesh of particles, connected by springs (Wikipedia link with more details). My plan is to have a big SSBO that stores the positions, velocities, and net forces for each particle. I'll have a compute shader that, for each spring, calculates the force between the particles on both ends of that spring (using Hook's law) and adds that to the net forces for those two particles. Then I'll have another compute shader that, for each particle, does some sort of Euler integration using the data from the SSBO, and then zeros the net forces for the next frame.
My problem is with memory synchronization. Each particle is attached to more than one spring, so in the first compute shader, different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order, but I'm unfamiliar with how OpenGL memory works, and I'm not sure how to avoid race conditions. In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) between the calls to the spring and particle compute shaders, so that the data written by the former is visible to latter. Is this necessary/sufficient?
I'm afraid to experiment, because I'm not sure what protections OpenGL gives against undefined behavior and accidentally screwing up your computer.
different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order
In this case you would need to use atomicAdd() in GLSL to make sure two separate threads don't get into a race condition.
In your case I don't think this will be a performance issue, but you should be aware that atomicAdd() can cause a big slowdown in cases where many threads are hitting the same location in memory at the same time (they have to serialize and wait for eachother). This performance issue is called "contention", and depending on the problem, you can usually improve it a lot by using warp-level primitives to make sure only 1 thread within each warp needs to actually commit the atomicAdd() (or other atomic operation).
Also "warps" are Nvidia terminology, AMD calls them "wavefronts", and there are different names still on other hardware vendors and API's.
In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT)
This is correct. Conceptually the way I think about it is that OpenGL compute shaders are async by default. This means, when you launch a compute shader, there's no guarantee when it will execute relative to subsequent commands. glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) will basically create a wait() between any draw/compute commands accessing that type of resource.

OpenGL: Do compute shader work groups execute in parallel?

It is clearly stated here that compute shader invocations are executed in parallel within a single work group. And we can synchronize their accesses to memory via barrier() + memoryBarrier() functions within that single group.
But may actually compute shader work groups within a single dispatch command be executed in parallel?
If so, then am I right, that it is impossible to synchronize their accesses to memory using any GLSL language functions? Because barrier() works only within a single work group, and so we can use only external synchronization via glMemoryBarrier to synchronize memory accesses from all work groups, but in this case we have to split compute shader into multiple different shaders, and execute them from separate dispatch commands.
Invocations from different work groups may be executed in parallel, yes. This is true of pretty much any shader stage's invocations.
And yes, you cannot perform inter-workgroup synchronization. This is also true of pretty much any shader stage; with one other exception, you cannot synchronize between any invocations of the same stage.
Well, not unless you have access to the fragment shader interlock extension, which as the name suggests is limited to fragment shaders.

How to use GL_MAP_UNSYNCHRONIZED_BIT with GL_MAP_PERSISTENT_BIT?

I have been working with GL_MAP_PERSISTENT_BIT and glBufferStorage/glMapBufferRange. I am curious if there is an improvement in performance possible using GL_MAP_UNSYNCHRONIZED_BIT.
I already found Opengl Unsynchronized/Non-blocking Map
But the answer seems to be a bit contradictory to me. It's said there that you need to sync or block when using this flag. What is the point of setting it unsynchronized if I have to sync it later then anyway? Also I tried this combination and was not able to see any performance difference. Does it even make sense together with persistent mapped buffers? I found literally no examples about such a usage.
The mentioned topic also says that you can
issue a barrier or flush that region of the buffer explicitly
But every attempt I made so far using these only resulted in garbage.
I am using currently triple buffering, but since I have to deal with very small chunks of data sometimes which I hardly can batch I had to find out that glBufferData is often faster in these cases and persistent buffers only of (huge) benefit if I can batch and reduce also the amount of drawcalls. Using GL_MAP_UNSYNCHRONIZED_BIT could be the key here.
Can anyone give me a working example, in case it even makes sense in this combination?
What is the point of setting it unsynchronized if I have to sync it later then anyway?
The point, as stated by that answer, is that OpenGL isn't doing the synchronization for you. You control when the synchronization happens. This means that you can ensure that it doesn't happen at an inappropriate time. By using your own synchronization, you can also ask the question, "are you finished using the buffer?" which is not a question you could ask without your own sync system.
By using unsynchronized mapping, you stop the implementation from having to check its own internal sync in addition to your synchronization.
However, that answer you linked to applies primarily to non-persistent mapping (since that's what the question was about). Unsynchronized mapping only applies to the map call itself. It prevents GL from issuing internal synchronization due to you calling glMapBufferRange.
But unsynchronized mapping doesn't really affect persistent mapping because... well, it's persistent. The whole point of the feature is that you keep the buffer mapped, so you're only going to call glMapBufferRange once. And the unsynchronized bit only applies at the moment you call glMapBufferRange.
So whether you use unsynchronized or not with your persistent mapping call is essentially irrelevant.

Compute shaders optimal data division on invocations (threads) and workgroups

As far as I understand from OpenGL documentation about compute shader compute spaces, I can divide data space into local invocations (threads) which will execute in parallel and in workgroups which will contain some number of local invocations and they will be executed not parallel (?) in random order, is I'm understand it correctly. Main question is what is the best strategy to divide data, should I always will try to maximize local invocation size and minimize number of workgroups to get better parallel execution or any other strategy will be better (for example I have 10000 elements in data buffer (velocity in x direction maybe) and any of element can be computed independent, how to determine best number of invocations (threads) and workgroups)?
P.S. For everyone who stumbles upon this question, here is an interesting article to read, which might answer your questions https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/
https://www.opengl.org/registry/doc/glspec45.core.pdf
Chapter 19:
A work group is a collection of shader invocations that execute the
same code, potentially in parallel.
While the individual shader
invocations within a work group are executed as a unit, work groups
are executed completely independently and in unspecified order.
After reading these section quite a few times over I find the "best" solution is to maximize local invocation size and minimize number of work groups because you then tell the driver to omit the requirement of invocation sets being independent. Fewer requirements mean fewer rules for the platform when it parses your intent into an execution, which universially yield better (or the same) result.
An invocation within a work group may share data with other members of
the same workgroup through shared variables (see section 4.3.8(“Shared
Variables”) of the OpenGL Shading Language Specification) and issue
memory and control barriers to synchronize with other members of the
same work group
Independence between invocations can be derived by the platform when compiling the shader code.

Storing OpenGL state

Suppose I'm trying to make some kind of a small opengl graphics engine on C++. I've read that accessing opengl state via glGet* functions can be quite expensive (while accessing opengl state seems to be an often operation), and it's strongly recommended to store a copy of opengl state somewhere with fast read/write access.
I'm currently thinking of storing the opengl state as a global thread_local variable of some appropriate type. How bad is that design? Are there any pitfalls?
If you want to stick with OpenGL's design (where your context pointer could be considered "thread_local") I guess it's a valid option... Obviously, you will need to have full control over all OpenGL calls in order to keep your state copy in sync with the current context's state.
I personally prefer to wrap the OpenGL state of interest using an "OpenGLState" class with a bunch of settable/gettable properties each mapping to some part of the state. You can then also avoid setting the same state twice. You could make it thread_local, but I couldn't (Visual C++ only supports thread_local for POD types).
You will need to be very careful, as some OpenGL calls indirectly change seemingly unrelated parts of the context's state. For example, glDeleteTextures will reset any binding of the deleted texture(s) to 0. Also, some toolkits are very "helpful" in changing OpenGL state behind your back (for example, QtOpenGLContext on OSX changes your viewport for you when made current).
Since you can only (reasonably) use a GL context with one thread, why do you need thread local? Yes, you can make a context current in different threads at different times, but this is not a wise design.
You will usually have one context and one thread accessing it. In rare cases, you will have two contexts (often shared) with two threads. In that case, you can simply put any additional state you wish to save into your context class, of which each instance is owned by exactly one thread.
But most of the time, you need not explicitly "remember" states anyway. All states have well-documented initial states, and they only change when you change them (exception being changes made by a "super smart" toolkit, but storing a wrong state doesn't help in that case either).
You will usually try to batch together states and do many "similar" draw calls with one set of states, the reason being that state changes are stalling the pipeline and need expensive validations being done before the next draw calls.
So, start off with the defaults, and set everything that needs to be non-default before drawing a batch. Then change what needs to be different for the next batch.
If you can't be bothered to dig through the specs for default values and keep track, you can redundantly set everything all the time. Then run your application in GDebugger, which will tell you what state changes are redundant, so you can elimiate them.