OpenGL: Do compute shader work groups execute in parallel? - opengl

It is clearly stated here that compute shader invocations are executed in parallel within a single work group. And we can synchronize their accesses to memory via barrier() + memoryBarrier() functions within that single group.
But may actually compute shader work groups within a single dispatch command be executed in parallel?
If so, then am I right, that it is impossible to synchronize their accesses to memory using any GLSL language functions? Because barrier() works only within a single work group, and so we can use only external synchronization via glMemoryBarrier to synchronize memory accesses from all work groups, but in this case we have to split compute shader into multiple different shaders, and execute them from separate dispatch commands.

Invocations from different work groups may be executed in parallel, yes. This is true of pretty much any shader stage's invocations.
And yes, you cannot perform inter-workgroup synchronization. This is also true of pretty much any shader stage; with one other exception, you cannot synchronize between any invocations of the same stage.
Well, not unless you have access to the fragment shader interlock extension, which as the name suggests is limited to fragment shaders.

Related

How to synchronize data for OpenGL compute shader soft body simulation

I'm trying to make a soft body physics simulation, using OpenGL compute shaders. I'm using a spring/mass model, where objects are modeled as being made out of a mesh of particles, connected by springs (Wikipedia link with more details). My plan is to have a big SSBO that stores the positions, velocities, and net forces for each particle. I'll have a compute shader that, for each spring, calculates the force between the particles on both ends of that spring (using Hook's law) and adds that to the net forces for those two particles. Then I'll have another compute shader that, for each particle, does some sort of Euler integration using the data from the SSBO, and then zeros the net forces for the next frame.
My problem is with memory synchronization. Each particle is attached to more than one spring, so in the first compute shader, different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order, but I'm unfamiliar with how OpenGL memory works, and I'm not sure how to avoid race conditions. In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) between the calls to the spring and particle compute shaders, so that the data written by the former is visible to latter. Is this necessary/sufficient?
I'm afraid to experiment, because I'm not sure what protections OpenGL gives against undefined behavior and accidentally screwing up your computer.
different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order
In this case you would need to use atomicAdd() in GLSL to make sure two separate threads don't get into a race condition.
In your case I don't think this will be a performance issue, but you should be aware that atomicAdd() can cause a big slowdown in cases where many threads are hitting the same location in memory at the same time (they have to serialize and wait for eachother). This performance issue is called "contention", and depending on the problem, you can usually improve it a lot by using warp-level primitives to make sure only 1 thread within each warp needs to actually commit the atomicAdd() (or other atomic operation).
Also "warps" are Nvidia terminology, AMD calls them "wavefronts", and there are different names still on other hardware vendors and API's.
In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT)
This is correct. Conceptually the way I think about it is that OpenGL compute shaders are async by default. This means, when you launch a compute shader, there's no guarantee when it will execute relative to subsequent commands. glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) will basically create a wait() between any draw/compute commands accessing that type of resource.

Semantics of barrier() in opengl compute shader

Let's say I have an opengl compute shader written in GLSL, executing on a NVidia Geforce 970.
At the start of the shader, a single invocation writes to a "Shader Storage Buffer Object" (SSBO).
I then issue a suitable barrier, like memoryBarrier() in my GLSL.
I then read from the memory written in the first step, in each invocation.
Will that first write be visible to all invocations in the current compute operation?
At https://www.khronos.org/opengl/wiki/Memory_Model#Ensuring_visibility , Khronos say:
"Use coherent and an appropriate memoryBarrier* or groupMemoryBarrier call if you use a mechanism like barrier to synchronize between invocations."
I'm pretty sure it's possible to synchronize this way within a work group. But does it work for all invocations in every work group, in the entire compute operation?
I'm unsure how an entire set of work groups is scheduled. I would expect them to possibly run sequentially, making the kind of synchronization I'm asking about impossible?
But does it work for all invocations in every work group, in the entire compute operation?
No. The scope of barrier is explicitly within a work group. And you cannot have visibility of operations that you haven't ensured have happened yet. The order of execution of work groups with respect to one another is undefined, so you don't know if one work group has executed yet.
What you want isn't really possible. You need instead to change how your shaders work so that work groups are not dependent on each other. In this case, you can have every work group perform this computation. And instead of storing it in global memory via an SSBO, store the result in a shared variable.
Yes, you'll be computing the same value in each group. But that will yield better performance than having all of those work groups wait on one work group. Especially since that's not something you can actually do.

When does it make sense to turn off the rasterization step?

In vulkan there is a struct which is required for pipeline creation, named VkPipelineRasterizationStateCreateInfo. In this struct there is a member named rasterizerDiscardEnable. If this member is set to VK_TRUE then all primitives are discarded before the rasterization step. This disables any output to the framebuffer.
I cannot think of a scenario where this might make any sense. In which cases could it be useful?
It would be for any case where you're executing the rendering pipeline solely for the side effects of the vertex processing stage(s). For example, you could use a GS to feed data into a buffer, which you later render from.
Now in many cases you could use a compute shader to do something similar. But you can't use a CS to efficiently implement tessellation; that's best done by the hardware tessellator. So if you want to capture data generated by tessellation (presumably because you'll be rendering with it multiple times), you have to use a rendering process.
A useful side-effect (though not necessarily the intended use case) of this parameter is for benchmarking / determining the bottle-neck of your Vulkan application: If discarding all primitives before the rasterization stage (and thus before any fragment shaders are ever executed) does not improve your frame-rate then you can rule out that your application performance is fragment stage-bound.

Compute shaders optimal data division on invocations (threads) and workgroups

As far as I understand from OpenGL documentation about compute shader compute spaces, I can divide data space into local invocations (threads) which will execute in parallel and in workgroups which will contain some number of local invocations and they will be executed not parallel (?) in random order, is I'm understand it correctly. Main question is what is the best strategy to divide data, should I always will try to maximize local invocation size and minimize number of workgroups to get better parallel execution or any other strategy will be better (for example I have 10000 elements in data buffer (velocity in x direction maybe) and any of element can be computed independent, how to determine best number of invocations (threads) and workgroups)?
P.S. For everyone who stumbles upon this question, here is an interesting article to read, which might answer your questions https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/
https://www.opengl.org/registry/doc/glspec45.core.pdf
Chapter 19:
A work group is a collection of shader invocations that execute the
same code, potentially in parallel.
While the individual shader
invocations within a work group are executed as a unit, work groups
are executed completely independently and in unspecified order.
After reading these section quite a few times over I find the "best" solution is to maximize local invocation size and minimize number of work groups because you then tell the driver to omit the requirement of invocation sets being independent. Fewer requirements mean fewer rules for the platform when it parses your intent into an execution, which universially yield better (or the same) result.
An invocation within a work group may share data with other members of
the same workgroup through shared variables (see section 4.3.8(“Shared
Variables”) of the OpenGL Shading Language Specification) and issue
memory and control barriers to synchronize with other members of the
same work group
Independence between invocations can be derived by the platform when compiling the shader code.

Correct usage / purpose of OpenGL Program Pipeline Objects

With OpenGL 4.1 and ARB_separate_shader_objects, we are able to store different stages of the shading pipeline in shader programs. As we know, to use these, we need attach them to a Program Pipeline Object, which is then bound.
My question is, why do we need the program pipeline objects at all? In my renderer, I have only one of these, and I change it's attachments to change shaders. I can't think of any case where you'd actually want more than one of these. If you store many pipeline objects, each containing different combinations of shader programs, then things end up even messier than not using separate shaders at all.
So, what is the purpose of the pipeline object? Is changing attachments (much) more expensive than binding a different pipeline object? What's the reason that the spec has this, rather than, say, having glUseProgramStages operate in the same way as glUseProgram?
The principle reason pipeline objects exist is that linking stages together in a program object did have certain advantages. For example, there are a lot of inter-shader-stage validation rules. And if the sequence of separate programs aren't valid, then people need to know.
With a program that links all stages together, you can detect these validation failures at link time. All of these tests are done precisely once and no more.
If you made "glUseProgramStages operate in the same way as glUseProgram", then every single time you render with a new set of shaders, the system will have to do inter-stage validation tests. Pipelines represent a convenient way to cache such tests. If you set their programs once and never modify them afterwards, then the result of validation for a pipeline will never change. Thus validation happens exactly once, just as it did for multi-stage programs.
Another issue is that implementations may need to do some minor shader fixup work when associating certain programs with each other. Pipeline objects represent a convenient place to cache such fix-up work. Without them, they'd have to be done every single time you change shaders.
Why do we need the program pipeline objects?
We don't need program pipeline objects they are purely optional. Using one Program Object for every shader combination that is in use is the easiest and most common way to do it.
So, what is the purpose of the pipeline object?
From https://www.opengl.org/registry/specs/ARB/separate_shader_objects.txt:
[...] Many developers build their shader content around the
mix-and-match approach where they can use a single vertex shader with
multiple fragment shaders (or vice versa). This extension adopts a "mix-and-match" shader stage model for GLSL
allowing multiple different GLSL program objects to be bound at once
each to an individual rendering pipeline stage independently of
other stage bindings. This allows program objects to contain only
the shader stages that best suit the applications needs. [...]