GPU hangs on spin-lock mechanism (GLSL + Vulkan) [duplicate]

GPU hangs on spin-lock mechanism (GLSL + Vulkan) [duplicate] - glsl

OpenGL red book version 9 (OpenGL 4.5) example 11.13 is Simple Per-Pixel Mutex. It uses imageAtomicCompSwap in a do {} while() loop to take a per-pixel lock to prevent simultaneous access to a shared resouce between pixel shader invocations corresponding to the same pixel coordinate.
layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;
void main(void)
{
ivec2 pos = ivec2(gl_FragCoord.xy);
// spinlock - acquire
uint lock_available;
do {
lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
} while (lock_available != 0);
// do some operations protected by the lock
do_something();
// spinlock - release
imageStore(lock_image, pos, uvec4(0));
}
This example results in APPCRASH on both Nvidia and AMD GPUs. I know on these two platforms PS vocations are unable to progress indepenently of each other - a sub-group of threads is executed in lockstep, sharing the control flow (a "warp" of 32 threads in Nvidia's terminology). So it may result in deadlock.
However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.". As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations? Does this mean Nvidia and AMD GPU not conform with OpenGL spec?

As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations?
If you are using atomic operations to lock access to a pixel, you are relying on one aspect of relative order: that all threads will eventually make forward progress. That is, you assume that any thread spinning on a lock will not starve the thread that has the lock of its execution resources. That threads holding the lock will eventually make forward progress and release it.
But since the relative order of execution is undefined, there is no guarantee of any of that. And therefore, your code cannot work. Any code which relies on any aspect of ordering between the invocations of a single shader stage cannot work (unless there are specific guarantees in place).
This is precisely why ARB_fragment_shader_interlock exists.
That being said, even if there were guarantees of forward progress, your code would still be broken.
You use a non-atomic operation to release the lock. You should be using an atomic set operation.
Plus, as others have pointed out, you need to continue to spin if the return value from the atomic compare/swap is not zero. Remember: all atomic functions return the original value from the image. So if the original value it atomically read is not 0, then it compared false and you don't have the lock.
Now, your code will still be UB by the spec. But it's more likely to work.

However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.".
You say this as if the wording of the GL spec would not cover the "lockstep" situation. But "The relative order of invocations of the same shader type are undefined." actually covers that. Given two shader invocations A and B, this statement means that you must not assume any of the following:
that A is executed before B
that B is executed before A
that A and B are executed in parallel
that A and B are not executed in parallel
that parts of A are executed before the same or other parts of B
that parts of B are exectued before the same or other parts of A
that parts of A and B are executed in parallel
that parts of A and B are not executed in parallel
... (probably a lot more) ...
The undefined order means you can never wait on the results of another invocation because there is no guarantee that this result of the other invocation can be exectued before the wait, except in situations where the GL spec makes certain extra guarantees, i.e:
when using explicit synchronization mechanisms like barrier()
there are some weak ordering guarantees between different shader stages
(I.e. it is allowed to assume that all vertex shader invoations have already happened when processing a fragment for that very primitive.)
For example, the GLSL Spec, Version 4.60 explains the concept of "invocation groups" in section 8.18:
Implementations of the OpenGL Shading Language may optionally group multiple shader
invocations for a single shader stage into a single SIMD invocation group, where invocations are
assigned to groups in an undefined implementation-dependent manner.
and the accompanying GL 4.6 core profie spec defines "invocation groups" in section 7.9 as
An invocation group [...] for a compute shader is the set of
invocations in a single work group. For graphics shaders, an invocation group is
an implementation-dependent subset of the set of shader invocations of a given
shader stage which are produced by a single drawing command. For MultiDraw*
commands with drawcount greater than one, invocations from separate draws are
in distinct invocation groups.
So besides for compute shaders, the GL gives you only draw-call-granularity other the invocation groups. This section of the spec also has a following footnote to make this absolutely clear:
Because the partitioning of invocations into invocation groups is implementation-dependent
and not observable, applications generally need to assume the worst case of all invocations in a draw belong to a single invocation group.
So besides that stronger statement about undefined relative invocation order, the spec also covers the "in-lockstep" SIMD processsing, and makes it very clear that you have not much control about it in the graphics pipeline.

If the execution order is the problem, reordering the code a bit might solve the problem:
layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;
void main(void)
{
ivec2 pos = ivec2(gl_FragCoord.xy);
// spinlock - acquire
uint lock_available;
do {
lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
if (lock_available == 0)
{
// do some operations protected by the lock
do_something();
// spinlock - release
imageAtomicExchange(lock_image, pos, 0);
}
} while (lock_available != 0);
}

Related

GLSL: can't implement a spinlock [duplicate]

OpenGL red book version 9 (OpenGL 4.5) example 11.13 is Simple Per-Pixel Mutex. It uses imageAtomicCompSwap in a do {} while() loop to take a per-pixel lock to prevent simultaneous access to a shared resouce between pixel shader invocations corresponding to the same pixel coordinate.
layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;
void main(void)
{
ivec2 pos = ivec2(gl_FragCoord.xy);
// spinlock - acquire
uint lock_available;
do {
lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
} while (lock_available != 0);
// do some operations protected by the lock
do_something();
// spinlock - release
imageStore(lock_image, pos, uvec4(0));
}
This example results in APPCRASH on both Nvidia and AMD GPUs. I know on these two platforms PS vocations are unable to progress indepenently of each other - a sub-group of threads is executed in lockstep, sharing the control flow (a "warp" of 32 threads in Nvidia's terminology). So it may result in deadlock.
However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.". As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations? Does this mean Nvidia and AMD GPU not conform with OpenGL spec?

As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations?
If you are using atomic operations to lock access to a pixel, you are relying on one aspect of relative order: that all threads will eventually make forward progress. That is, you assume that any thread spinning on a lock will not starve the thread that has the lock of its execution resources. That threads holding the lock will eventually make forward progress and release it.
But since the relative order of execution is undefined, there is no guarantee of any of that. And therefore, your code cannot work. Any code which relies on any aspect of ordering between the invocations of a single shader stage cannot work (unless there are specific guarantees in place).
This is precisely why ARB_fragment_shader_interlock exists.
That being said, even if there were guarantees of forward progress, your code would still be broken.
You use a non-atomic operation to release the lock. You should be using an atomic set operation.
Plus, as others have pointed out, you need to continue to spin if the return value from the atomic compare/swap is not zero. Remember: all atomic functions return the original value from the image. So if the original value it atomically read is not 0, then it compared false and you don't have the lock.
Now, your code will still be UB by the spec. But it's more likely to work.

However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.".
You say this as if the wording of the GL spec would not cover the "lockstep" situation. But "The relative order of invocations of the same shader type are undefined." actually covers that. Given two shader invocations A and B, this statement means that you must not assume any of the following:
that A is executed before B
that B is executed before A
that A and B are executed in parallel
that A and B are not executed in parallel
that parts of A are executed before the same or other parts of B
that parts of B are exectued before the same or other parts of A
that parts of A and B are executed in parallel
that parts of A and B are not executed in parallel
... (probably a lot more) ...
The undefined order means you can never wait on the results of another invocation because there is no guarantee that this result of the other invocation can be exectued before the wait, except in situations where the GL spec makes certain extra guarantees, i.e:
when using explicit synchronization mechanisms like barrier()
there are some weak ordering guarantees between different shader stages
(I.e. it is allowed to assume that all vertex shader invoations have already happened when processing a fragment for that very primitive.)
For example, the GLSL Spec, Version 4.60 explains the concept of "invocation groups" in section 8.18:
Implementations of the OpenGL Shading Language may optionally group multiple shader
invocations for a single shader stage into a single SIMD invocation group, where invocations are
assigned to groups in an undefined implementation-dependent manner.
and the accompanying GL 4.6 core profie spec defines "invocation groups" in section 7.9 as
An invocation group [...] for a compute shader is the set of
invocations in a single work group. For graphics shaders, an invocation group is
an implementation-dependent subset of the set of shader invocations of a given
shader stage which are produced by a single drawing command. For MultiDraw*
commands with drawcount greater than one, invocations from separate draws are
in distinct invocation groups.
So besides for compute shaders, the GL gives you only draw-call-granularity other the invocation groups. This section of the spec also has a following footnote to make this absolutely clear:
Because the partitioning of invocations into invocation groups is implementation-dependent
and not observable, applications generally need to assume the worst case of all invocations in a draw belong to a single invocation group.
So besides that stronger statement about undefined relative invocation order, the spec also covers the "in-lockstep" SIMD processsing, and makes it very clear that you have not much control about it in the graphics pipeline.

If the execution order is the problem, reordering the code a bit might solve the problem:
layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;
void main(void)
{
ivec2 pos = ivec2(gl_FragCoord.xy);
// spinlock - acquire
uint lock_available;
do {
lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
if (lock_available == 0)
{
// do some operations protected by the lock
do_something();
// spinlock - release
imageAtomicExchange(lock_image, pos, 0);
}
} while (lock_available != 0);
}

Issue with simple atomic counter test in OpenGL compute shader

I've been trying to wrap my head around memory synchronization and coherency by trying trivial examples.
In this, I'm dispatching a compute shader with 8x8x1 size work groups. The number of work groups is sufficient to cover the screen, which is 720x480.
Compute shader code:
#version 450 core
layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
layout (binding = 0, rgba8) uniform image2D u_fboImg;
layout (binding = 0, offset = 0) uniform atomic_uint u_counters[100];
void main() {
ivec2 texCoord = ivec2(gl_GlobalInvocationID.xy);
// Only use shader invocations within first 100x500 pixels
if (texCoord.x >= 100 || texCoord.y >= 500) {
return;
}
// Each counter should be incremented 400 times
atomicCounterIncrement(u_counters[texCoord.x]);
memoryBarrier();
// Use only "bottom row" of invocations to draw results
// Draw a white column as high as the counter at given x
if (texCoord.y == 0) {
int c = int(atomicCounter(u_counters[texCoord.x]));
for (int y = 0; y < c; ++y) {
imageStore(u_fboImg, ivec2(texCoord.x, y), vec4(1.0f));
}
}
}
This is what I get: (The heights of the jagged bars are different each time, but on average about that height)
This is what I would expect, and is the result of hard coding the for loop to go to 400.
Strangely enough, if I decrease the number of work groups in the dispatch, say by halving the x value (would now only cover half the screen), the bars get bigger:
Finally to prove there isn't some other nonsense going on, here i'm just coloring based on local invocation id:
*Edit: Forgot to mention the dispatch is followed immediately by glMemoryBarrier(GL_ALL_BARRIER_BITS);

Unless otherwise stated, all shader invocations for a particular shader stage, including the compute shader stage, execute independently of one another, in an order which is undefined. And calling memoryBarrier does not change this fact. This means that, when the stuff after the memoryBarrier is called, there is no guarantee that the value from the atomic counter has been incremented by all of the shader invocations that will eventually do so.
So what you're seeing is exactly what one would expect to see: the invocations writing somewhat random values, depending on the implementation-dependent order that the invocations just so happen to be executed in.
What you're wanting to do is execute all of the atomic increments for all invocations, then read those values and draw stuff based on what you read. Your code as written cannot do that.
While compute shaders do have some ability to manipulate the order of execution of invocations, this only works for invocations within the same work group (this is in fact why work groups exist). That is, you can have invocations ordered to a degree in a work group, but never between work groups.
The simple fix for this is to turn it into 2 compute shader dispatch operations. The first does all of the incrementing. The second will read the values and write the results to the image.
More clever solutions would involve employing work groups. That is, group your work so that whatever would have incremented the same atomic counter will execute within the same work group. This way, you don't even need atomic counters; you just use shared variables (which can perform atomic operations). You call barrier() after you do all of the incrementing of the shared variable; that ensures that all invocations have executed at least that far before any invocation continues past that point. So all of the incrementing is done.

GLSL per-pixel spinlock using imageAtomicCompSwap

OpenGL red book version 9 (OpenGL 4.5) example 11.13 is Simple Per-Pixel Mutex. It uses imageAtomicCompSwap in a do {} while() loop to take a per-pixel lock to prevent simultaneous access to a shared resouce between pixel shader invocations corresponding to the same pixel coordinate.
layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;
void main(void)
{
ivec2 pos = ivec2(gl_FragCoord.xy);
// spinlock - acquire
uint lock_available;
do {
lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
} while (lock_available != 0);
// do some operations protected by the lock
do_something();
// spinlock - release
imageStore(lock_image, pos, uvec4(0));
}
This example results in APPCRASH on both Nvidia and AMD GPUs. I know on these two platforms PS vocations are unable to progress indepenently of each other - a sub-group of threads is executed in lockstep, sharing the control flow (a "warp" of 32 threads in Nvidia's terminology). So it may result in deadlock.
However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.". As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations? Does this mean Nvidia and AMD GPU not conform with OpenGL spec?

As in this example, why can we not use atomic operation imageAtomicCompSwap to ensure exclusive access between different PS invocations?
If you are using atomic operations to lock access to a pixel, you are relying on one aspect of relative order: that all threads will eventually make forward progress. That is, you assume that any thread spinning on a lock will not starve the thread that has the lock of its execution resources. That threads holding the lock will eventually make forward progress and release it.
But since the relative order of execution is undefined, there is no guarantee of any of that. And therefore, your code cannot work. Any code which relies on any aspect of ordering between the invocations of a single shader stage cannot work (unless there are specific guarantees in place).
This is precisely why ARB_fragment_shader_interlock exists.
That being said, even if there were guarantees of forward progress, your code would still be broken.
You use a non-atomic operation to release the lock. You should be using an atomic set operation.
Plus, as others have pointed out, you need to continue to spin if the return value from the atomic compare/swap is not zero. Remember: all atomic functions return the original value from the image. So if the original value it atomically read is not 0, then it compared false and you don't have the lock.
Now, your code will still be UB by the spec. But it's more likely to work.

However, there is nowhere that OpenGL spec mentioned "threads executed in lockstep". It only mentioned "The relative order of invocations of the same shader type are undefined.".
You say this as if the wording of the GL spec would not cover the "lockstep" situation. But "The relative order of invocations of the same shader type are undefined." actually covers that. Given two shader invocations A and B, this statement means that you must not assume any of the following:
that A is executed before B
that B is executed before A
that A and B are executed in parallel
that A and B are not executed in parallel
that parts of A are executed before the same or other parts of B
that parts of B are exectued before the same or other parts of A
that parts of A and B are executed in parallel
that parts of A and B are not executed in parallel
... (probably a lot more) ...
The undefined order means you can never wait on the results of another invocation because there is no guarantee that this result of the other invocation can be exectued before the wait, except in situations where the GL spec makes certain extra guarantees, i.e:
when using explicit synchronization mechanisms like barrier()
there are some weak ordering guarantees between different shader stages
(I.e. it is allowed to assume that all vertex shader invoations have already happened when processing a fragment for that very primitive.)
For example, the GLSL Spec, Version 4.60 explains the concept of "invocation groups" in section 8.18:
Implementations of the OpenGL Shading Language may optionally group multiple shader
invocations for a single shader stage into a single SIMD invocation group, where invocations are
assigned to groups in an undefined implementation-dependent manner.
and the accompanying GL 4.6 core profie spec defines "invocation groups" in section 7.9 as
An invocation group [...] for a compute shader is the set of
invocations in a single work group. For graphics shaders, an invocation group is
an implementation-dependent subset of the set of shader invocations of a given
shader stage which are produced by a single drawing command. For MultiDraw*
commands with drawcount greater than one, invocations from separate draws are
in distinct invocation groups.
So besides for compute shaders, the GL gives you only draw-call-granularity other the invocation groups. This section of the spec also has a following footnote to make this absolutely clear:
Because the partitioning of invocations into invocation groups is implementation-dependent
and not observable, applications generally need to assume the worst case of all invocations in a draw belong to a single invocation group.
So besides that stronger statement about undefined relative invocation order, the spec also covers the "in-lockstep" SIMD processsing, and makes it very clear that you have not much control about it in the graphics pipeline.

If the execution order is the problem, reordering the code a bit might solve the problem:
layout (binding = 0, r32ui) uniform volatile coherent uimage2D lock_image;
void main(void)
{
ivec2 pos = ivec2(gl_FragCoord.xy);
// spinlock - acquire
uint lock_available;
do {
lock_available = imageAtomicCompSwap(lock_image, pos, 0, 1);
if (lock_available == 0)
{
// do some operations protected by the lock
do_something();
// spinlock - release
imageAtomicExchange(lock_image, pos, 0);
}
} while (lock_available != 0);
}

Using only part of a varying variable

Let's say I have a varying variable between any two GLSL shader stages (e.g. the vertex and fragment stage) declared as a vec4:
in/out/varying vec4 texCoord;
What happens if I only use part of that variable (say, through swizzling) in both shaders, i.e. I only write to a part of it in the vertex shader and only read from that same part in the fragment shader?
// vertex shader
texCoord.st = ...
// fragment shader
... = texture2D(..., texCoord.st);
Is that guartanteed (i.e. by specification) to always produce sane results? It seems reasonable that it does, however I'm not too well-versed in the intricacies of GLSL language-lawyering and don't know if that varying variable is interpreted as somehow "incomplete" by the compiler/linker because it isn't fully written to in the preceding stage. I'm sure the values of texCoord.pq will be undefined anyway, but does that affect the validity of texCoord.st too or does the whole varying system operate on a pure component level?
I haven't found anything to that effect in the GLSL specification on first glance and I would prefer answers based either on the actual specification or any other "official" guarantees, rather than statements that it should work on reasonable hardware (unless of course this case simply is unspecified or implementation-defined). I would also be interested in any possible changes of that throughout GLSL history, including all the way back to its appliance to deprecated builtin varying variables like gl_TexCoord[] in good old GLSL 1.10.

I'm trying to argue that your code will be fine, as per the specification. However, I'm not sure if you will find my reasoning 100% convincing, because I think that the spec seems somewhat imprecise about this. I'm going to refer to the OpenGL 4.5 Core Profile Specification and the OpenGL Shading language 4.50 specification.
Concerning input and output variables, the GLSL spec established the following in section 4.3.4
Shader input variables are declared with the storage qualifier in. They form the input interface between
previous stages of the OpenGL pipeline and the declaring shader. [...] Values from the previous pipeline stage are copied into input variables at the beginning of
shader execution.
and 4.3.6, respectively:
Shader output variables are declared with a storage qualifier using the storage qualifier out. They form
the output interface between the declaring shader and the subsequent stages of the OpenGL pipeline. [...]
During shader execution they will behave as normal
unqualified global variables. Their values are copied out to the subsequent pipeline stage on shader exit.
Only output variables that are read by the subsequent pipeline stage need to be written; it is allowed to
have superfluous declarations of output variables.
Section 5.8 "Assignments" establishes that
Reading a variable before writing (or initializing) it is legal, however the value is undefined.
Since the assignment of the .st vector will write to the sub-vector, we can establish that this variable will contain two intialized and two un-initialized components at the end of the shader invocation, and the whole vector will be copied to the output.
Section 11.1.2.1 of the GL spec states:
If the output variables are passed directly to the vertex processing
stages leading to rasterization, the values of all outputs are
expected to be interpolated across the primitive being rendered,
unless flatshaded. Otherwise the values of all outputs are collected
by the primitive assembly stage and passed on to the subsequent
pipeline stage once enough data for one primitive has been collected.
"The values of all outputs" are determined by the shader, and although some components have undefined values, they still have values, and there is no undefined or implementation-defined behavior here. The interpolation formulas for the line and polygon primitives (sections 14.5.1 and 14.6.1) also never mix between the components, so any defined component value will result in a defined value in the interpolated datum.
Section 11.1.2.1 also contains this statement about the vertex shader outputs:
When a program is linked, all components of any outputs written by a
vertex shader will count against this limit. A program whose vertex
shader writes more than the value of MAX_VERTEX_OUTPUT_COMPONENTS
components worth of outputs may fail to link, unless device-dependent
optimizations are able to make the program fit within available
hardware resources.
Note that this language implies that the full 4 components of a vec4 are counted against the limit as soon as a single component is written to.

On output variables, the specification says:
Their values are copied out to the subsequent pipeline stage on shader exit.
So the question boils down to two things:
What is the value of such an output variable?
That is easily answered. The section on swizzling makes it clear that writing to a swizzle mask will not modify the components that are not part of the swizzle mask. Since you did not write to those components, their values are undefined. So undefined values will be copied out to the subsequent pipeline stage.
Will interpolation of undefined values affect the interpolation of defined values?
No. Interpolation is a component-wise operation. The result of one component's interpolation cannot affect another's.
So this is fine.

Setting gl_TessLevel only for one output vertex?

On the Internet I found some examples of TCS code, where gl_TessLevel* variables are set only for one output patch vertex
// first code snippet
if ( gl_InvocationID == 0 ) // set tessellation level, can do only for one vertex
{
gl_TessLevelOuter [0] = foo
gl_TessLevelOuter [1] = bar;
}
instead of just
// second code snippet
gl_TessLevelOuter [0] = foo;
gl_TessLevelOuter [1] = bar;
It works similarly with and without condition checking, but I didn't find anything about such usage on OpenGL wiki.
If to think logically, it should be OK to set these variables only in one TCS invocation, and it would be weird to set them to different values based on gl_InvocationID. So my questions are:
Is this way of setting gl_TessLevel* correct and may it cause errors or crashes on some platforms?
If it's correct, should it be used always? Is it idiomatic?
And finally, how do both snippets affect performance? May the first snippet slow-down performance due to branching? May the second snippet cause redundant and/or idle invocations of subsequent pipeline stages, also slowing down performance?

What you are seeing here is an attempt by the shader's author to establish a convention similar to provoking vertices used by other primitive types.
OpenGL Shading Language 4.50 - 2.2 Tessellation Control Processor - p. 7
Tessellation control shader invocations run mostly independently, with undefined relative execution order.
However, the built-in function barrier() can be used to control execution order by synchronizing invocations, effectively dividing tessellation control shader execution into a set of phases.
Tessellation control shaders will get undefined results if one invocation reads a per-vertex or per-patch attribute written by another invocation at any point during the same phase, or if two invocations attempt to write different
values to the same per-patch output in a single phase.
It is unclear given the shader psuedo-code whether foo and bar are uniform across all TCS invocations. If they are not, the second shader snippet invokes undefined behavior due to the undefined relative ordering.
Arbitrarily deciding that the first invocation is the only one that is allowed to write the per-patch attribute solves this problem and is analogous to a first-vertex provoking convention. A last-vertex convention could just as easily be implemented since the number of patch vertices is known to all invocations.
None of this is necessary if you know foo and bar are constant, however.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js