Why does barrier synchronize shared memory when memoryBarrier doesn't? - glsl

The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}

The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT​, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT​...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT​ set in barriers​ between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers​ between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent​, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared()​; this barrier is
specifically for shared variable ordering. groupMemoryBarrier()​
acts like memoryBarrier()​, ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier()​ function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier()​, all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.

Related

How to discard an invocation in GLSL compute shader

I'm trying to implement forward+ rendering using compute shaders in GLSL 4.6, but I don't know how to synchronize threads within a work group when working with off-screen pixels. For example, my window resolution is 1600x900, and I'm using a work group size of 16x16, where each thread or invocation corresponds to a single pixel on the screen, this means that size_x = 1600/16 = 100 and size_y = 900/16 = 56.25, so I need to call
glDispatchCompute(100, 57, 1);
As you can see, some threads in a work group may represent pixels that extend beyond the screen, so I want to return early or discard these off-screen pixels to skip the complex computation. However, my compute shader also contains a barrier() call in several places in order to synchronize local threads, I don't know how to combine them. The documentation says
For any given static instance of barrier in a compute shader, all invocations within a single work group must enter it before any are allowed to continue beyond it.
......
Barriers are also disallowed after a return statement
The only workaround I can think of is to fake computations for these threads, or use if-else to let them finish early in an intermediate stage between two barrier() calls. I guess this will introduce a little performance penalty. So, is there a better way to rule out invalid threads in a work group? I believe this problem is quite common for compute shaders so there might be an idiomatic way of handling it.

If a `buffer` is `coherent`, is there any difference between reading a field or doing `atomicAdd(field, 0)`?

This is with Vulkan semantics, if it makes any difference.
Assume the following:
layout(...) coherent buffer B
{
uint field;
} b;
Say the field is being modified by other invocations of the same shader (or a derived shader) through atomic*() funcions.
If a shader invocation wants to perform an atomic read from this field (with the same semantics as atomicCounter() in GLES, had this been an atomic_uint instead), is there any difference between the following two (other than obviously that one of them does a write as well as read)?
uint read_value = b.field;
uint read_value2 = atomicAdd(b.field, 0);
To directly answer the question, those two lines of code generate different instructions, with differing performance characteristics and hardware pipeline usage.
uint read_value = b.field; // generates a load instruction
uint read_value2 = atomicAdd(b.field, 0); // generates an atomic instruction
AMD disassembly can be seen in this online Shader Playground -- buffer_load_dword versus buffer_atomic_add
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking -- LDG versus ATOM
The GLSL spec section 4.10 Memory Qualifiers makes a point that coherent is only about visibility of reads and writes across invocations (shader threads). They also left a comment on the implied performance:
When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read. While variables not declared as coherent might not be useful for communicating between shader invocations, using non-coherent accesses may result in higher performance.
The point-of-coherence in GPU memory systems is usually the last-level cache (L2 cache), meaning all coherent accesses must be performed by the L2 cache. This also means coherent buffers cannot be cached in L1 or other caches closer to the shader processors. Modern GPUs also have dedicated atomic hardware in the L2 caches; a plain load will not use those, but an atomicAdd(..., 0) will go through those. The atomic hardware usually has lower bandwidth than the full L2 cache.
SPIR-V has an OpAtomicLoad instruction. Presumably, there is at least one piece of hardware in which non-atomic loads cannot replace an atomic load no matter what qualifier the buffer descriptor has.
Unfortunately, there is no Vulkan GLSL construct that can translate to OpAtomicLoad that I'm aware of.

Occlusion Queries and Instanced Rendering

I'm facing a problem where the use of an occlusion query in combination with instanced rendering would be desirable.
As far as I understood, something like
glBeginQuery(GL_ANY_SAMPLES_PASSED, occlusionQuery);
glDrawArraysInstanced(mode, i, j, countInstances);
glEndQuery(GL_ANY_SAMPLES_PASSED);
will only tell me, if any of the instances were drawn.
What I would need to know is, what set of instances has been drawn (giving me the IDs of all visible instances). Drawing each instance in an own call is no option for me.
An alternative would be to color-code the instances and detect the visible instances manually.
But is there really no way to solve this problem with a query command and why would it not be possible?
It's not possible for several reasons.
Query objects only contain a single counter value. What you want would require a separate sample passed count for each instance.
Even if query objects stored arrays of sample counts, you can issue more than one draw call in the begin/end scope of a query. So how would OpenGL know which part of which draw call belonged to which query value in the array? You can even change other state within the query scope; uniform bindings, programs, pretty much anything.
The samples-passed count is determined entirely by the rasterizer hardware on the GPU. And the rasterizer neither knows nor cares which instance generated a triangle.
Instancing is a function of the vertex processing and/or vertex specification stages; by the time the rasterizer sees it, that information is gone. Notice that fragment shaders don't even get an instance ID as input, unless you explicitly create one by passing it from your vertex processing stage(s).
However, if you truly want to do this you could use image load/store and its atomic operations. That is, pass the fragment shader the instance in question (as an int data type, with flat interpolation). This FS also uses a uimageBuffer buffer texture, which uses the GL_R32UI format (or you can use an SSBO unbounded array). It then performs an imageAtomicAdd, using the instance value passed in as the index to the buffer. Oh, and you'll need to have the FS explicitly require early tests, so that samples which fail the fragment tests will not execute.
Then use a compute shader to build up a list of rendering commands for the instances which have non-zero values in the array. Then use an indirect rendering call to draw the results of this computation. Now obviously, you will need to properly synchronize access between these various operations. So you'll need to use appropriate glMemoryBarrier calls between each one.
Even if queries worked the way you want them to, this would be overall far more preferable than using a query object. Unless you're reading a query into a buffer object, reading a query object requires a GPU/CPU synchronization of some form. Whereas the above requires some synchronization and barrier operations, but they're all on-GPU operations, rather than synchronizing with the CPU.

glsl compute shader - synchronization

I can define a shared data structure (for example an array):
shared float [gl_WorkGroupSize.x]
for each workgroup. Execution order inside a workgroup is undefined so at some point I may need to synchronize all threads which use a shared array, for example all threads have to write some data to the shared array before calculations. I found two ways to achieve this:
OpenGL SuperBible:
barrier();
memoryBarrierShared();
OpenGL 4 Shading Language Cookbook:
barrier();
Should I call memoryBarrierShared after barrier ? Could you give me some practical examples when I can use memoryBarrierShared or memoryBarrier without using barrier ?
Memory barriers ensure visibility in otherwise incoherent memory access.
What this really means is that an invocation of your compute shader will not be allowed to attempt some sort of optimization that would read and/or write cached memory.
Writing to something like a Shader Storage Buffer is an example of ordinarily incoherent memory access, without a memory barrier changes made in one invocation are only guaranteed to be visible within that invocation. Other invocations are allowed to maintain their own cached view of the memory unless you tell the GLSL compiler to enforce coherent memory access and where to do so (memoryBarrier* ()).
There is a serious caveat here, and that is that visibility is only half of the equation. Forcing coherent memory access when the shader is compiled does nothing to solve actual execution order issues across threads in a workgroup. To make sure that all executions in a workgroup have finished processing up to a certain point in your shader, you must use barrier ().
Consider the following Comptue Shader pseudo code:
#version 450
layout (local_size_x = 128) in;
shared float foobar [128]; // shared implies coherent
void main (void)
{
foobar [gl_LocalInvocationIndex] = 0.0;
memoryBarrierShared​ (); // Ensure change to foobar is visible in other invocations
barrier (); // Stall until every thread is finished clearing foobar
// At this point, _every_ index (0-127) of `foobar` will have the value **0.0**.
// Without the barrier, and just the memory barrier, the contents of everything
// but foobar [gl_LocalInvocationIndex] would be undefined at this point.
}
Outside of GLSL, there are also barriers at the GL command level (glMemoryBarrier (...)). You would use those in situations where you need a compute shader to finish executing before GL is allowed to do something that depends on its results.
In the traditional render pipeline GL can implicitly figure out which commands must wait for others to finish (e.g. glReadPixels (...) stalls until all commands finish writing to the framebuffer). However, with compute shaders and image load/store, implicit synchronization no longer works and you have to tell GL which pipeline memory operations must be finished and visible to the next command.

GLSL coherent imageBuffer access in single-stage (fragment shader), single-pass scenario

I have a single fragment shader that performs processing on an imageBuffer using image load/store operations.
I am exclusively concerned about the following scenario:
I have a single fragment shader (no multistage (eg. vertex then fragment shaders) considerations, and no multipass rendering)
imageBuffer variables are declared as coherent. Exclusively interested in coherent imageBuffers.
To make things perfectly clear, my scenario is the following:
// Source code of my sole and unique fragment shader:
coherent layout(1x32) uniform uimageBuffer data;
void main()
{
...
various calls to imageLoad(data, ..., ...);
...
various calls to imageStore(data, ..., ...);
...
}
I have largely looked at the spec
ARB_shader_image_load_store
especially this very paragraph:
"Using variables declared as "coherent" guarantees that the results of
stores will be immediately visible to shader invocations using
similarly-declared variables; calling MemoryBarrier is required to
ensure that the stores are visible to other operations."
Note: my "coherent uniform imageBuffer data;" declaration precisely is a "similarly-declared" variable. My scenario is single-pass, single-stage (fragment shader).
Now, I have looked at various web sites and stumbled (like most people I think) upon this thread on stackoverflow.com:
How exactly is GLSL's "coherent" memory qualifier interpreted by GPU drivers for multi-pass rendering?
and more specifically, this paragraph:
"Your shaders cannot even make the assumption that issuing a load
right after a store will get the memory that was just stored in this
very shader (yes really. You have to put a memoryBarrier in to pull
that one off)."
My question is the following:
With the coherent qualifier specified, in my single-shader, single-pass processing scenario, can I yes or no be sure that imageStore()'s will be immediately visible to ALL invocations of my fragment shader (eg. the current invocation as well as other concurrent invocations)?
By reading the ARB_shader_image_load_store spec, it seems to me that:
the answer to this question is yes,
I don't need any kind of memoryBarrier(),
the quoted sentence in the above referenced thread in stackoverflow may indeed be misleading and wrong.
Thanks for your insight.
Use that memory barrier.
For one thing GPU may optimize and fetch whole blocks of memeory to read FROM, and have separate memory to write TO.
In other words if Your shader always modify SINGLE location JUST ONCE then its ok, but if it relay on neighbors values AFTER some computation was applied, then You need memory barrier.
With the coherent qualifier specified, in my single-shader, single-pass processing scenario, can I yes or no be sure that imageStore()'s will be immediately visible to ALL invocations of my fragment shader (eg. the current invocation as well as other concurrent invocations)?
If each fragment shader writes to separate locations in the image, and each fragment shader only reads the locations that it wrote, then you don't even need coherent. However, if a fragment shader instance wants to read data written by other fragment shader instances, you're SOL. There's nothing you can do for that one.
If it were a compute shader, you could issue the barrier call to synchronize operations within a work group. That would ensure that the writes you want to read happen (you still need the memoryBarrier call to make them visible). But that would only ensure that writes from instances within this work group have happened. Writes from other instances are still undefined.
and more specifically, this paragraph:
BTW, that paragraph was wrong. Very very wrong. Too bad the person who wrote that paragraph will never ever be identified ;)