I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?
Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.
Related
I'm hoping someone can fill a gap in my knowledge of multithreaded programming here.
I have used mutexes on many occasions to successfully share access to data-structures between threads. For example, I'll often feed a queue on the main thread that is consumed by a worker thread. A mutex wraps access to the queue as tightly as possible when the main thread enqueues something, or when the worker thread dequeues something. It works great!
Okay, now consider a different situation I found myself in recently. Multiple threads are rendering triangles to the same framebuffer/z-buffer pair. (This may be a bad idea on its face, but please overlook that for the moment.) In other words, the work-load of rendering a bunch of triangles is evenly distributed across all these threads that are all writing pixels to the same framebuffer and all checking Z-values against, and updating Z-values to, the same Z-buffer.
Now, I knew this would be problematic from the get-go, but I wanted to see what would happen. Sure enough, when drawing two quads, (one behind the other), some of the pixels from the background quad would occasionally bleed through the foreground quad, unless, of course, I only had one worker thread. So, to fix this problem, I decided to use a mutex. I knew this would be extremely slow, but I wanted to do it anyway just to demonstrate that I had a handle on what the problem really was. The fix was simple: just wrap access to the Z-buffer with a mutex. But to my great surprise, this didn't fix the problem at all! So the question is: why?!
I have a hypothesis, but it is a disturbing one. My guess is that at least one of two things is happening. First, a thread may write to the Z-buffer, but that write operation isn't necessarily flushed from CPU-memory back to Z-buffer memory when another thread goes to read it. Second, a thread may read from the Z-buffer, but do so in prefetched amounts that assume no other thread is writing to it. In either case, even if the mutex is doing its job correctly, there are still going to be cases where we're either reading the wrong Z-value or failing to write a Z-value.
What may support this hypothesis is that I unnecessarily widened the mutex lock time, and while this didn't just make my rendering slower, it also appeared to fix the Z-buffer issue previously described. Why? My guess is because the extra lock time made it more likely that the Z-buffer writes were flushed.
Anyhow, this is disturbing to me, because I don't know why this isn't a problem I've run into before with, for example, simple queues I've been using to communicate between threads for years. Why wasn't the CPU lazy about flushing its cache with my link-list pointers?
So I looked around for maybe ways to add a memory fence or flush a write operation or to make sure that a read operation always pulled from memory (e.g., by using the "volatile" keyword), but none of it was trivial or seemed to help.
What am I not understanding here? Do I really just have no idea what's going on? Thanks for any light you can shed on this.
The fix was simple: just wrap access to the Z-buffer with a mutex.
This is not enough -- the mutex needs to cover both the access to the Z buffer and update of the framebuffer, making the whole operation (check&update Z buffer, conditionally update framebuffer) atomic. Otherwise there is a danger that the Z buffer and framebuffer updates will "cross" and happen in the reverse order. Somthing like:
thread 1: check/update z buffer (hit -- pixel is closer than previous)
thread 2: check/update z buffer (hit -- pixel is closer than previous)
thread 2: update framebuffer
thread 1: update framebuffer
and you end up with thread 1's color in the framebuffer even though thread 2 is closer in Z
When running two Dispatch() calls consecutively, like:
m_computeCommandList->Dispatch(111, 1, 1);
m_computeCommandList->Dispatch(555, 1, 1);
Is it guaranteed that the second Dispatch() will run after the first Dispatch() on the GPU? Or, could they run concurrently on the GPU?
Just to clarify, there is no more C++ code in between those two Dispatch() calls.
Like in other graphics API, when you execute command calls on CPU side it leads to putting these commands to a command queue. It guarantees that commands will be processed in the order of queue, First-In-First-Out.
However, on GPU everything becomes massive parallel and concurrent. We can't know on what processing unit the actual execution will be scheduled, or what threads from what Dispatch will be finished earlier. Typically it's not a problem if there are no resources (buffers, textures) shared between invocations, and we need to synchronize only the end of frame.
If there is resource sharing, there is a possibility of some memory conflicts ("write-read", "write-write" or "read-write"). Here we need to use resource barriers that allow us to organize access to these resources. Using different options for barriers you can reach the consecutive execution of different Dispatch calls.
For example, a transition from D3D12_RESOURCE_STATE_UNORDERED_ACCESS to D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE|D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE will wait for ALL preceding Graphics and Compute shader execution to complete, and block ALL subsequent Graphics and Compute shader execution.
Enhanced barriers in DirectX 12 allow you to get fine-tuned control on resource and execution synchronization.
I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?
Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.
If you have a compute shader where different work groups in the same dispatch are put in a continuous loop and you want to signal them all to exit said loop by any of them setting a flag. Is this actually possible?
I've tried using a flag in an SSBO marked both coherent and volatile to trigger their exit. Which sometimes doesn't work on AMD it seems. When one of the work groups wants to trigger all of them to exit I simply set the flag from zero to one directly (as it doesn't matter as long as someone sets it) with no use of atomics and call a memoryBarrier() afterwards.
From the documentation of memoryBarrier() it seems to me like it will guarantee the eventual visibility of the set to other work groups within the same dispatch?
From the documentation of memoryBarrier() it seems to me like it will guarantee the eventual visibility of the set to other work groups within the same dispatch?
Visibility, yes. But what about the execution of them?
Do you have any guarantees that the work group that will trigger this shutdown will be executed alongside the looping work groups? No.
There are no ordering guarantees at all between invocations from different work groups. So it is entirely possible that the GPU will fill all of its execution time with the looping work groups. Which means that they will wait for a signal that will never come.
Compute shaders offer no guarantees that all threads will progress. And therefore, you cannot write code that assumes they will. The only tool to control the execution of compute shaders is barrier, and that only controls the execution of invocations within a work group.
So when you call opengl functions, like glDraw or gLBufferData, does it cause the thread of the program to stop and wait for GL to finish the calls?
If not, then how does GL handle calling important functions like glDraw, and then immediately afterwards having a setting changed that affects the draw calls?
No, they (mostly) do not. The majority of GL functions are buffered when used and actually executed later. This means that you cannot think of the CPU and the GPU as two processors working together at the same time. Usually, the CPU executes a bunch of GL functions that get buffered and, as soon as they are delivered to the GPU, this one executes them. This means that you cannot reliably control how much time it took for a specific GL function to execute by just comparing the time before and after it's execution.
If you want to do that, you need to first run a glFinish() so it will actually wait for all previously buffered GL calls to execute, and then you can start counting, execute the calls that you want to benchmark, call glFinish again to make sure these calls executed as well, and then finish the benchmark.
On the other hand, I said "mostly". This is because reading functions will actually NEED to synchronize with the GPU to show real results and so, in this case, they DO wait and freeze the main thread.
edit: I think the explanation itself answers the question you asked second, but just in case: the fact that all calls are buffered make it possible for a draw to complete first, and then change a setting afterwards for succesive calls
It strictly depends on the OpenGL call in question and the OpenGL state. When you make OpenGL calls, the implementation first queues them up internally and then executes them asynchronously to the calling program's execution. One important concept of OpenGL are synchronization points. Those are operations in the work queue that require the OpenGL call to block until certain conditions are met.
OpenGL objects (textures, buffer objects, etc.) are purely abstract and by specification the handle of an object in the client program always to the data, the object has at calling time of OpenGL functions that refer to this object. So take for example this sequence:
glBindTexture(GL_TEXTURE_2D, texID);
glTexImage2D(..., image_1);
draw_textured_quad();
glTexImage2D(..., image_2);
draw_textured_quad();
The first draw_textured_quad may return even long before anything has been drawn. However by making the calls OpenGL creates an internal reference to the data currently hold by the texture. So when glTexImage2D is called a second time, which may happen before the first quad was drawn, OpenGL must internally create a secondary texture object that's to become texture texID and to be used by the second calls of draw_textured_quad. If glTexSubImage2D was called, it would even have to make a modified copy of it.
OpenGL calls will only block, if the result of the call modifies client side memory and depends of data generated by previous OpenGL calls. In other words, when doing OpenGL calls, the OpenGL implementation internally generates a dependency tree to keep track of what depends on what. And when a synchronization point must block it will at least block until all dependencies are met.