I use an atomic counter in a compute shader with an atomic_uint bound to a dynamic GL_ATOMIC_COUNTER_BUFFER (in a similar way to this opengl-atomic-counter tutorial lighthouse3d).
I'm using the atomic counter in a particle system to check a condition has been reached for all particles; I expect to see counter==numParticles when all of the particles are in the correct place.
I map the buffer each frame and check if the atomic counter has counted all of the particles:
GLuint *ptr = (GLuint *) glMapBuffer( GL_ATOMIC_COUNTER_BUFFER, GL_READ_ONLY );
GLuint particleCount = ptr[ 0 ];
glUnmapBuffer( GL_ATOMIC_COUNTER_BUFFER );
if( particleCount == numParticles() ){ // do stuff }
On a single GPU host the code works fine and particleCount always reaches numParticles() but on a multi gpu host the particleCount never reaches numParticles().
I can visually check that the condition has been reached and the test should be true however particleCount is changing each frame going up and down but never reaching numParticles().
I have tried an opengl memory barrier on the GL_ATOMIC_COUNTER_BARRIER_BIT before I unmap particleCount:
glMemoryBarrier(GL_ATOMIC_COUNTER_BARRIER_BIT);
GLuint *ptr = (GLuint *) glMapBuffer( GL_ATOMIC_COUNTER_BUFFER, GL_READ_ONLY );
GLuint particleCount = ptr[ 0 ];
glUnmapBuffer( GL_ATOMIC_COUNTER_BUFFER );
if( particleCount == m_particleSystem->numParticles() )
{ // do stuff }
and I've tried a glsl barrier before incrementing the counter in the compute shader:
memoryBarrierAtomicCounter();
atomicCounterIncrement( particleCount );
but the atomic counter doesn't seem to synchronise across devices.
What is the correct way to synchronise so that the atomic counter works with multiple devices?
Your choice of memory barrier is actually inappropriate in this situation.
That barrier (GL_ATOMIC_COUNTER_BARRIER_BIT) would make changes to the atomic counter visible (e.g. flush caches and run shaders in a specific order), but what it does not do is make sure that any concurrent shaders are complete before you map, read and unmap your buffer.
Since your buffer is being mapped and read back, you do not need that barrier - that barrier is for coherency between shader passes. What you really need is to ensure all shaders that access your atomic counter are finished before you try to read data using a GL command, and for this you need GL_BUFFER_UPDATE_BARRIER_BIT.
GL_BUFFER_UPDATE_BARRIER_BIT:
Reads/writes via glBuffer(Sub)Data, glCopyBufferSubData, glProgramBufferParametersNV, and glGetBufferSubData, or to buffer object memory mapped by glMapBuffer(Range) after the barrier will reflect data written by shaders prior to the barrier.
Additionally, writes via these commands issued after the barrier will wait on the completion of any shader writes to the same memory initiated prior to the barrier.
You may be thinking about barriers from the wrong perspective. The barrier you need depends on which type of operation the memory read needs to be coherent to.
I would suggest brushing up on the incoherent memory access usecases:
(1) Shader write/read between rendering commands
One Rendering Command writes incoherently, and the other reads. There is no need for coherent(GLSL qualifier) here at all. Just use glMemoryBarrier before issuing the reading rendering command, using the appropriate access bit.
(2) Shader writes, other OpenGL operations read
Again, coherent is not necessary. You must use a glMemoryBarrier before performing the read, using a bitfield that is appropriate to the reading operation of interest.
In case (1), the barrier you want is in-fact GL_ATOMIC_COUNTER_BARRIER_BIT, because it will force strict memory and execution order rules between different shader passes that share the same atomic counter.
In case (2), the barrier you want is GL_BUFFER_UPDATE_BARRIER_BIT. The "reading operation of interest" is glMapBuffer (...) and as shown above, that is covered under GL_BUFFER_UPDATE_BARRIER_BIT.
In your situation, you are reading the buffer back using the GL API. You need GL commands to wait for all pending shaders to finish writing (this does not happen automatically for incoherent memory access - image load/store, atomic counters, etc.). That is textbook case (2).
Related
I have a piece of code looks like this:
Core 0:
memset64(buffer, 0xFFFFFFFFFFFFFFFFULL, 4096);
position.atomic_set(1);
buffer is a pointer pointing to a 4KB buffer in DDR. position is an atomic int variable.
See https://elixir.bootlin.com/linux/v4.1/source/arch/arm64/include/asm/atomic.h
Core 1:
while (1) {
if (position.atomic_read() == 1) {
break;
}
}
*buffer = "4KB data";
There are two threads working on two cores simultaneously.
Core 0 will memset the buffer first, then it will set the atomic variable position to 1.
Core 1 will atomic read the variable position in an infinite loop, if the value of position is 1, then it breaks and write something to the buffer.
I met a bug: after the program finished, the value of position is 1, however, the buffer's content is full of 0xFF.
It seems like one of the possible root cause is that, In core 1's view, the atomic variable is set to 1 before the memset, so core 1's modification to the buffer was overwritten by core 0's memset.
I'm compiling the code with arm64 gcc 8.3 with -O2 optimization. The elf will be run on a 2 cluster, 8 core ARM Cortex-A53 CPU.
I wonder that in aarch64, is there any possibility that the atomic_set() is executed before the memset64() on CPU? Or, is there possibility that another CPU core saw the atomic value changes before it saw the memset?
I'm new to system programming, so I'm not very familiar with concepts like memory model. I will be very grateful anyone can give me some suggestions.
In the current thread, the one that invokes those two lines of code above, subsequent reads to values in buffer or reading position will have expected behavior. So if your program is single-threaded, you can stop now.
However, other threads that poll for position and observe new_pos may then subsequently read from buffer but not see the expected values the original thread wrote. It may read "old" values from the buffer. This is sometimes referred to as a memory ordering or cache coherency issue. It is perfectly normal - especially in multicore and multiprocessor systems. Rarely an issue on single-core systems.
You need to introduce a memory barrier to force the synchronization. The easiest way is to just use a mutex for reading and writing those values.
That is, in the original thread:
{
std::lock<std::mutex>(mut); // mut is a shared instance of std::mutex
memset64(buffer, 0xFFFFFFFFFFFFFFFFULL, 4096);
position.atomic_set(new_pos);
}
Then "read" position and buffer from the other thread by taking the lock first.
{
std::lock<std::mutex>(mut); // same instance of mut above
auto pos = position.atomic_get();
if (pos != last_pos)
{
last_pos = pos;
// read values of buffer while inside the lock
}
}
And a side effect of using the mutex is that it may make the use of atomic primitives needless.
Can i run this code in a loop without reading results from SSBO? And only read SSBO results after 100 iterations.
for (int i=0; i <100; i++){
glDispatchCompute(1, 200, 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);//i understand this needed to ensure
//it is done running the glsl code in GPU from previous iteration
}
Also will glsl code executed e.g. second time within the loop(i==1) see results of first glsl execution in SSBO (i==0)?
Finally do i really need glMemoryBarrier call in the loop or it can be outside the loop? I am concerned that GPU code will not see changes done by first iteration in SSBO when executed second time.
1) Yes, you can run your shader multiple times without reading the contents of the buffer you are writting to and read them at the end (this is a very common practice on iterative GPU sorting algorithms)
2) If you are reading/writting to the same buffer, yes, they will be visible
3) Yes, you need a barrier, otherwise the compute shader dipatch will be launched without waiting for the previous to finish, which will lead to wrong results (as you are concerned), if not crashes. However, the barrier type will depend on what you are doing within your shader. Here is a full list of barriers
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glMemoryBarrier.xhtml
Most probably, if you are focusing on reading/writing to a SSBO, you should use the barrier GL_SHADER_STORAGE_BARRIER_BIT, but if you are not sure, you can just use GL_ALL_BARRIER_BITS.
I'm using an atomic counter which has it's own buffer and i want to clear that counter in some other pass. So is it good to bind the buffer as shader storage buffer to clear it and then, in a second pass use it as atomic counter buffer? Also i would like to ask if it's even ok to use the same buffer as shader storage buffer and as atomic counter buffer at the same time in the same shader, let's say, 4 bytes at the start are intended only for the atomic counter, while in the same buffer is other data which is read/modified.
You can use the same buffer with different targets, but you have to manage alignment requirements yourself (All params for glGet with ALIGNMENT in their names).
Then you can invalidate a range with InvalidateBufferSubData as a performance hint (the GPU doesn't have to preserve content you will clear) and clear the range with ClearBufferSubData.
For better performance I would advise using double or triple buffering for the atomic counters, or any data that are often cleared or updated.
I can define a shared data structure (for example an array):
shared float [gl_WorkGroupSize.x]
for each workgroup. Execution order inside a workgroup is undefined so at some point I may need to synchronize all threads which use a shared array, for example all threads have to write some data to the shared array before calculations. I found two ways to achieve this:
OpenGL SuperBible:
barrier();
memoryBarrierShared();
OpenGL 4 Shading Language Cookbook:
barrier();
Should I call memoryBarrierShared after barrier ? Could you give me some practical examples when I can use memoryBarrierShared or memoryBarrier without using barrier ?
Memory barriers ensure visibility in otherwise incoherent memory access.
What this really means is that an invocation of your compute shader will not be allowed to attempt some sort of optimization that would read and/or write cached memory.
Writing to something like a Shader Storage Buffer is an example of ordinarily incoherent memory access, without a memory barrier changes made in one invocation are only guaranteed to be visible within that invocation. Other invocations are allowed to maintain their own cached view of the memory unless you tell the GLSL compiler to enforce coherent memory access and where to do so (memoryBarrier* ()).
There is a serious caveat here, and that is that visibility is only half of the equation. Forcing coherent memory access when the shader is compiled does nothing to solve actual execution order issues across threads in a workgroup. To make sure that all executions in a workgroup have finished processing up to a certain point in your shader, you must use barrier ().
Consider the following Comptue Shader pseudo code:
#version 450
layout (local_size_x = 128) in;
shared float foobar [128]; // shared implies coherent
void main (void)
{
foobar [gl_LocalInvocationIndex] = 0.0;
memoryBarrierShared (); // Ensure change to foobar is visible in other invocations
barrier (); // Stall until every thread is finished clearing foobar
// At this point, _every_ index (0-127) of `foobar` will have the value **0.0**.
// Without the barrier, and just the memory barrier, the contents of everything
// but foobar [gl_LocalInvocationIndex] would be undefined at this point.
}
Outside of GLSL, there are also barriers at the GL command level (glMemoryBarrier (...)). You would use those in situations where you need a compute shader to finish executing before GL is allowed to do something that depends on its results.
In the traditional render pipeline GL can implicitly figure out which commands must wait for others to finish (e.g. glReadPixels (...) stalls until all commands finish writing to the framebuffer). However, with compute shaders and image load/store, implicit synchronization no longer works and you have to tell GL which pipeline memory operations must be finished and visible to the next command.
The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT set in barriers between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared(); this barrier is
specifically for shared variable ordering. groupMemoryBarrier()
acts like memoryBarrier(), ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier() function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier(), all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.