OpenGL commands - sequential or parallel - c++

I'm reading this document
and I have a question about this sentence:
While OpenGL explicitly requires that commands are completed in order,
that does not mean that two (or more) commands cannot be concurrently
executing. As such, it is possible for shader invocations from one
command to be exeucting in tandem with shader invocations from other
commands.
Does this mean that, for example, when I issue two consecutive glDrawArrays calls it is possible that the second call is processed immediately before the first one has finished?
My first idea was that the OpenGL calls merely map to internal commands of the gpu and that the OpenGL call returns immediately without those commands completed, thus enabling the second OpenGL call to issue its own internal commands. The internal commands created by the OpenGL calls can then be parallelized.

What is says is, that the exact order in which the commands are executed and any concurrency is left to the judgement of the implementation with the only constraint being that the final result must look exactly as if all the commands would have been executed one after another in the very order they were called by the client program.
EDIT: Certain OpenGL calls cause an implicit or explicit synchronization. Reading back pixels for example or waiting for a synchronization event.

Related

Do two consecutive DirectX 12 Dispatch() calls run sequentially or concurrently on the GPU?

When running two Dispatch() calls consecutively, like:
m_computeCommandList->Dispatch(111, 1, 1);
m_computeCommandList->Dispatch(555, 1, 1);
Is it guaranteed that the second Dispatch() will run after the first Dispatch() on the GPU? Or, could they run concurrently on the GPU?
Just to clarify, there is no more C++ code in between those two Dispatch() calls.
Like in other graphics API, when you execute command calls on CPU side it leads to putting these commands to a command queue. It guarantees that commands will be processed in the order of queue, First-In-First-Out.
However, on GPU everything becomes massive parallel and concurrent. We can't know on what processing unit the actual execution will be scheduled, or what threads from what Dispatch will be finished earlier. Typically it's not a problem if there are no resources (buffers, textures) shared between invocations, and we need to synchronize only the end of frame.
If there is resource sharing, there is a possibility of some memory conflicts ("write-read", "write-write" or "read-write"). Here we need to use resource barriers that allow us to organize access to these resources. Using different options for barriers you can reach the consecutive execution of different Dispatch calls.
For example, a transition from D3D12_RESOURCE_STATE_UNORDERED_ACCESS to D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE|D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE will wait for ALL preceding Graphics and Compute shader execution to complete, and block ALL subsequent Graphics and Compute shader execution.
Enhanced barriers in DirectX 12 allow you to get fine-tuned control on resource and execution synchronization.

Compute shader image and atomic coherency [duplicate]

I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?
Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.

In DX12 what Ordering Guarantees do multiple ExecuteCommandLists calls provide?

Assuming a single threaded application. If you call ExecuteCommandLists twice (A and B). Is A guaranteed to execute all of its commands on the GPU before starting any of the commands from B? The closest thing I can find in the documentation is this, but it doesn't really seem to guarantee A finishes before B starts:
Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.
As a point of comparison, I know that this is explicitly not guarenteed in Vulkan:
vkQueueSubmit is a queue submission command, with each batch defined by an element of pSubmits as an instance of the VkSubmitInfo structure. Batches begin execution in the order they appear in pSubmits, but may complete out of order.
However, I'm not sure if DX12 works the same way.
Frank Luna's book says:
The command lists are executed in order starting with the first array element
However in that context he's talking about calling ExecuteCommandLists once with two command lists (C and D). Do these operate the same as two individual calls? My colleague argues that this still only guarantees that they are started in order, not that C finishes before D starts.
Is there more clear documentation somewhere I'm missing?
I asked the same question in the Direct X forums, here's an answer from Microsoft engineer Jesse Natalie:
Calling ExecuteCommandLists twice guarantees that the first workload
(A) finishes before the second workload (B). Calling
ExecuteCommandLists with two command lists allows the driver to merge
the two command lists such that the second command list (D) may begin
executing work before all work from the first (C) has finished.
Specifically, the application is allowed to insert a fence signal or
wait between A and B, and the driver has no visibility into this, so
the driver must ensure that everything in A is complete before the
fence operation. There is no such opportunity in a single call to the
API, so the driver can optimize that scenario.
Source:
http://forums.directxtech.com/index.php?topic=5975.0
Finally the ID3D12CommandQueue is a first-in first-out queue, that stores the correct order of the command lists for submission to the GPU. Only when one command list has completed execution on the GPU, will the next command list from the queue be submitted by the driver.
https://learn.microsoft.com/en-us/windows/win32/direct3d12/porting-from-direct3d-11-to-direct3d-12
This isn't correct. I believe DirectX12 is the same as Vulkan
The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order
I've just ran into this again. Command list A is not guaranteed to complete before command list B starts. And this creates race conditions
A writes A reads
πŸ “ πŸ “
────────────────────
πŸ ‘ πŸ ‘
B writes B reads
Edit: It turns out I was doing something stupid (calling CopyTextureRegion on two buffers) and this was casing a stall (which I could see in pix) and so my work for my next frame was started during this stall resulting in a race condition sometimes. Usually the commands for one frame complete before the next start, and if they don't you will see a gap in PIX where no work is happening for the currently view frame's timings.

Do opengl functions cause main thread to freeze?

So when you call opengl functions, like glDraw or gLBufferData, does it cause the thread of the program to stop and wait for GL to finish the calls?
If not, then how does GL handle calling important functions like glDraw, and then immediately afterwards having a setting changed that affects the draw calls?
No, they (mostly) do not. The majority of GL functions are buffered when used and actually executed later. This means that you cannot think of the CPU and the GPU as two processors working together at the same time. Usually, the CPU executes a bunch of GL functions that get buffered and, as soon as they are delivered to the GPU, this one executes them. This means that you cannot reliably control how much time it took for a specific GL function to execute by just comparing the time before and after it's execution.
If you want to do that, you need to first run a glFinish() so it will actually wait for all previously buffered GL calls to execute, and then you can start counting, execute the calls that you want to benchmark, call glFinish again to make sure these calls executed as well, and then finish the benchmark.
On the other hand, I said "mostly". This is because reading functions will actually NEED to synchronize with the GPU to show real results and so, in this case, they DO wait and freeze the main thread.
edit: I think the explanation itself answers the question you asked second, but just in case: the fact that all calls are buffered make it possible for a draw to complete first, and then change a setting afterwards for succesive calls
It strictly depends on the OpenGL call in question and the OpenGL state. When you make OpenGL calls, the implementation first queues them up internally and then executes them asynchronously to the calling program's execution. One important concept of OpenGL are synchronization points. Those are operations in the work queue that require the OpenGL call to block until certain conditions are met.
OpenGL objects (textures, buffer objects, etc.) are purely abstract and by specification the handle of an object in the client program always to the data, the object has at calling time of OpenGL functions that refer to this object. So take for example this sequence:
glBindTexture(GL_TEXTURE_2D, texID);
glTexImage2D(..., image_1);
draw_textured_quad();
glTexImage2D(..., image_2);
draw_textured_quad();
The first draw_textured_quad may return even long before anything has been drawn. However by making the calls OpenGL creates an internal reference to the data currently hold by the texture. So when glTexImage2D is called a second time, which may happen before the first quad was drawn, OpenGL must internally create a secondary texture object that's to become texture texID and to be used by the second calls of draw_textured_quad. If glTexSubImage2D was called, it would even have to make a modified copy of it.
OpenGL calls will only block, if the result of the call modifies client side memory and depends of data generated by previous OpenGL calls. In other words, when doing OpenGL calls, the OpenGL implementation internally generates a dependency tree to keep track of what depends on what. And when a synchronization point must block it will at least block until all dependencies are met.

GLSL, semaphores?

I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?
Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.