In DX12 what Ordering Guarantees do multiple ExecuteCommandLists calls provide? - directx-12

Assuming a single threaded application. If you call ExecuteCommandLists twice (A and B). Is A guaranteed to execute all of its commands on the GPU before starting any of the commands from B? The closest thing I can find in the documentation is this, but it doesn't really seem to guarantee A finishes before B starts:
Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.
As a point of comparison, I know that this is explicitly not guarenteed in Vulkan:
vkQueueSubmit is a queue submission command, with each batch defined by an element of pSubmits as an instance of the VkSubmitInfo structure. Batches begin execution in the order they appear in pSubmits, but may complete out of order.
However, I'm not sure if DX12 works the same way.
Frank Luna's book says:
The command lists are executed in order starting with the first array element
However in that context he's talking about calling ExecuteCommandLists once with two command lists (C and D). Do these operate the same as two individual calls? My colleague argues that this still only guarantees that they are started in order, not that C finishes before D starts.
Is there more clear documentation somewhere I'm missing?

I asked the same question in the Direct X forums, here's an answer from Microsoft engineer Jesse Natalie:
Calling ExecuteCommandLists twice guarantees that the first workload
(A) finishes before the second workload (B). Calling
ExecuteCommandLists with two command lists allows the driver to merge
the two command lists such that the second command list (D) may begin
executing work before all work from the first (C) has finished.
Specifically, the application is allowed to insert a fence signal or
wait between A and B, and the driver has no visibility into this, so
the driver must ensure that everything in A is complete before the
fence operation. There is no such opportunity in a single call to the
API, so the driver can optimize that scenario.
Source:
http://forums.directxtech.com/index.php?topic=5975.0

Finally the ID3D12CommandQueue is a first-in first-out queue, that stores the correct order of the command lists for submission to the GPU. Only when one command list has completed execution on the GPU, will the next command list from the queue be submitted by the driver.
https://learn.microsoft.com/en-us/windows/win32/direct3d12/porting-from-direct3d-11-to-direct3d-12
This isn't correct. I believe DirectX12 is the same as Vulkan
The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order
I've just ran into this again. Command list A is not guaranteed to complete before command list B starts. And this creates race conditions
A writes A reads
πŸ “ πŸ “
────────────────────
πŸ ‘ πŸ ‘
B writes B reads
Edit: It turns out I was doing something stupid (calling CopyTextureRegion on two buffers) and this was casing a stall (which I could see in pix) and so my work for my next frame was started during this stall resulting in a race condition sometimes. Usually the commands for one frame complete before the next start, and if they don't you will see a gap in PIX where no work is happening for the currently view frame's timings.

Related

How to check what progress guarantee a concurrent program follows?

I was working on some concurrent programs for the past few weeks and was wondering if there is any tool that can automatically detect what type of progress condition its operations guarantees, that is whether it is wait-free, lock-free or obstruction-free.
I searched online and didn't find any such tool.
Can one tell how to deduce progress condition of a program?
Assume that I have a program called a wait-freedom decider that can read a concurrent program describing a data structure and detect whether it is wait free, i.e. "one that guarantees that any process can complete any operation in a finite number of steps" ala Herlihy's "Wait-Free Synchronization". Then, given a single-threaded program P, create a program that we will feed into the wait-freedom decider:
class DataStructure:
def operation(this):
P
pass
Now DataStructure.operation completes in a finite number of steps if and only if P halts.
This would solve the halting problem. That's impossible, so, by contradiction, we must not be able to create a wait-freedom decider.

OpenGL commands - sequential or parallel

I'm reading this document
and I have a question about this sentence:
While OpenGL explicitly requires that commands are completed in order,
that does not mean that two (or more) commands cannot be concurrently
executing. As such, it is possible for shader invocations from one
command to be exeucting in tandem with shader invocations from other
commands.
Does this mean that, for example, when I issue two consecutive glDrawArrays calls it is possible that the second call is processed immediately before the first one has finished?
My first idea was that the OpenGL calls merely map to internal commands of the gpu and that the OpenGL call returns immediately without those commands completed, thus enabling the second OpenGL call to issue its own internal commands. The internal commands created by the OpenGL calls can then be parallelized.
What is says is, that the exact order in which the commands are executed and any concurrency is left to the judgement of the implementation with the only constraint being that the final result must look exactly as if all the commands would have been executed one after another in the very order they were called by the client program.
EDIT: Certain OpenGL calls cause an implicit or explicit synchronization. Reading back pixels for example or waiting for a synchronization event.

OpenCV -- safe to enqueue the same function with different data on same Stream?

I'm trying to optimise my OpenCV code to run on the GPU. The problem is that there seem to be conflicting opinions on what is and isn't safe to run on the GPU.
In the thread here: how to use gpu::Stream in OpenCV? , the answer states:
Currently, you may face problems if same operation is enqueued twice with different data to different streams.
I would be happy to solve this by enqueuing these operations onto the same stream. However, in the document here http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-itseez-opencv-webinar.pdf the author writes (slide 28):
Current limitation:
– Unsafe to enqueue the same GPU operation multiple times
And he shows an example in which he states that it's unsafe to enqueue the same operation on the same stream, too.
I am confused -- would it be safe for me to enqueue the same operation on the same Stream, or not? Does anyone know?
Intuitively, I would have thought it to be OK, since the same stream would, I imagine, run in serial, so the two functions would never try to concurrently access the same data. But I'd really like confirmation before I implement something.
Thank you for your help!

C++ synced stages multi thread pipeline

Sorry if this has been asked, I did my best to search for a dup before asking...
I am implementing a video processing application that is supposed to run real time. The processing it does can be easily divided into 4 stages, each one operating on the intermediate values generated from the previous stage. The processing has become heavier than what can be processed in 1/30th of a second, but if I can split this application in 4 threads and turn it into a pipeline, each stage takes less than that and the whole thing would run realtime (with a 4 frame lag, which is completely acceptable).
I'm fairly new to multithreading programming, and the problem I'm having is, I can't find a mechanism to start/stop each thread at the beginning of each frame, so they all march together, delivering one finished frame every "cycle" at the end. All the frameworks/libraries I found seem to worry about load balancing using queues and worker threads, but this is not what I need here. Four threads will do, assuming I can keep them synced.
Can anybody point me to a starting point, using C++?
Thanks.
Assuming a 4 frame lag is acceptable, you could use a pool of list nodes, each with a pointer to a frame buffer and a pointer to the intermediate values (a NULL pointer could be used to indicate end of a stream). Each thread would have it's own list as part of multithread messaging system. The first thread would get a frame node from a free pool, do it's processing and send the node to the next threads list, and so on, with the last thread returning nodes back to the free pool.
Here is a link to an example file copy program that spawns a thread to do the writes. It uses Windows threading, mutexes, and semaphores in the messaging functions, but the messaging functions are simple and could be changed internally to use generic equivalents without changing their interface. The main() function could be changed to use generic threading and setup of the mutexes and semaphores or something equivalent.
mtcopy.zip

Simultaneous Execution of Functions

I am creating an application which must execute a function that takes too long (lets call it slowfunc()), which is a problem, since my application is working with a live video feed. By running this function every frame, the frame rate is severely affected.
Is there a way to run slowfunc() in the background without using threading? I don't necessarily need it to run every frame, but every time it finishes, I'd like to examine the output. The only thing I can think of right now is to split up slowfunc() into several "mini-functions" which would each take approximately an equal amount of time, then run one minifunction per frame. However, slowfunc() is a relatively complex function, and I feel that there should be (hopefully is) a way to do this simply.
EDIT: I can't use threading because this program will eventually be used on a tiny robot processor which probably will not support threading. I guess I can use "cooperative multitasking". Thanks for your help!
Run it in a thread and after the calculation is finished, make the thread sleep until another calculation is ready to be run. That way you are not hit with the initialization of the thread every time.
You're asking for simaltaneous execution. The two ways to do this are -:
a) Multi-threading -: Create another thread to run on a background.
b) MUlti-processing -: Create another process . Take all inputs required for the function via a shared memory model. Create a synchronisation mechanism with original process(parent process). Execute this function.
It is normally prefered to use the 1st one. Faster execution.
The 2nd one guarantees that if the function crashes your parent process still runs. Although, that is a bit irrelevant since, why would you want your child(function) to crash. This needs more memory.