Do two consecutive DirectX 12 Dispatch() calls run sequentially or concurrently on the GPU?

Do two consecutive DirectX 12 Dispatch() calls run sequentially or concurrently on the GPU? - c++

When running two Dispatch() calls consecutively, like:
m_computeCommandList->Dispatch(111, 1, 1);
m_computeCommandList->Dispatch(555, 1, 1);
Is it guaranteed that the second Dispatch() will run after the first Dispatch() on the GPU? Or, could they run concurrently on the GPU?
Just to clarify, there is no more C++ code in between those two Dispatch() calls.

Like in other graphics API, when you execute command calls on CPU side it leads to putting these commands to a command queue. It guarantees that commands will be processed in the order of queue, First-In-First-Out.
However, on GPU everything becomes massive parallel and concurrent. We can't know on what processing unit the actual execution will be scheduled, or what threads from what Dispatch will be finished earlier. Typically it's not a problem if there are no resources (buffers, textures) shared between invocations, and we need to synchronize only the end of frame.
If there is resource sharing, there is a possibility of some memory conflicts ("write-read", "write-write" or "read-write"). Here we need to use resource barriers that allow us to organize access to these resources. Using different options for barriers you can reach the consecutive execution of different Dispatch calls.
For example, a transition from D3D12_RESOURCE_STATE_UNORDERED_ACCESS to D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE|D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE will wait for ALL preceding Graphics and Compute shader execution to complete, and block ALL subsequent Graphics and Compute shader execution.
Enhanced barriers in DirectX 12 allow you to get fine-tuned control on resource and execution synchronization.

Related

Compute shader image and atomic coherency [duplicate]

I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?

Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.

Correct place to use cudaSetDeviceFlags?

Win10 x64, CUDA 8.0, VS2015, 6-core CPU (12 logical cores), 2 GTX580 GPUs.
In general, I'm working on a multithreaded application that launches 2 threads that are associated with 2 GPUs available, these threads are stored in a thread pool.
Each thread does the following initialization procedure upon it's launch (i.e. this is done only ones during the runtime of each thread):
::cudaSetDevice(0 or 1, as we have only two GPUs);
::cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
::cudaSetDeviceFlags(cudaDeviceMapHost | cudaDeviceScheduleBlockingSync);
Then, from other worker threads (12 more threads that do not touch GPUs at all), I begin feeding these 2 GPU-associated worker threads with data, it works perfectly as long as the number of GPU threads being laucnhed is equal to the number of physical GPUs available.
Now I want to launch 4 GPU threads (i.e 2 threads per GPU) and make each one work via separate CUDA stream. I know the requirements that are essential for proper CUDA streams usage, I meet all of them. What I'm failing on is the initialization procedure mentioned above.
As soon as this procedure is attempted to be executed twice from different GPU threads but for the same GPU, the ::cudaSetDeviceFlags(...) starts failing with "cannot set while device is active in this process" error message.
I have looked into the manual and seems like I get the reason why this happens, what I can't understand is how to use ::cudaSetDeviceFlags(...) for my setup properly.
I can comment this ::cudaSetDeviceFlags(...) line and the propgram will work fine even for 8 thread per GPU, but I need the cudaDeviceMapHost flag to be set in order to use streams, pinned memory won't be available otherwise.
EDIT Extra info to consider #1:
If to call ::cudaSetDeviceFlags before ::cudaSetDevice then no error
occurs.
Each GPU thread allocates a chunk of pinned memory via
::VirtualAlloc ->::cudaHostRegister approach upon thread launch
(works just fine no matter how many GPU threads launched) and
deallocates it upon thread termination (via ::cudaHostUnregister ->
::VirtualFree). ::cudaHostUnregister fails with "pointer does not
correspond to a registered memory region" for half the threads if the number of threads per GPU is greater than 1.

Well, highly sophisticated method of trythis-trythat-seewhathappens-tryagain practice finally did the trick, as always.
Here is the excerpt from the documentation on ::cudaSetDeviceFlags():
Records flags as the flags to use when initializing the current
device. If no device has been made current to the calling thread, then
flags will be applied to the initialization of any device initialized
by the calling host thread, unless that device has had its
initialization flags set explicitly by this or any host thread.
Consequently, in the GPU worker thread it is necessary to call ::cudaSetDeviceFlags() before ::cudaSetDevice().
I have implemented somthing like this in the GPU thread initialization code in order to make sure that device flags being set before the device set are actually applied properly:
bse__throw_CUDAHOST_FAILED(::cudaSetDeviceFlags(nFlagsOfDesire));
bse__throw_CUDAHOST_FAILED(::cudaSetDevice(nDevice));
unsigned int nDeviceFlagsActual = 0;
bse__throw_CUDAHOST_FAILED(::cudaGetDeviceFlags(&nDeviceFlagsActual));
bse__throw_IF(nFlagsOfDesire != nDeviceFlagsActual);
Also, the comment of talonmies showed the way to resolve the ::cudaHostUnregister errors.

OpenGL commands - sequential or parallel

I'm reading this document
and I have a question about this sentence:
While OpenGL explicitly requires that commands are completed in order,
that does not mean that two (or more) commands cannot be concurrently
executing. As such, it is possible for shader invocations from one
command to be exeucting in tandem with shader invocations from other
commands.
Does this mean that, for example, when I issue two consecutive glDrawArrays calls it is possible that the second call is processed immediately before the first one has finished?
My first idea was that the OpenGL calls merely map to internal commands of the gpu and that the OpenGL call returns immediately without those commands completed, thus enabling the second OpenGL call to issue its own internal commands. The internal commands created by the OpenGL calls can then be parallelized.

What is says is, that the exact order in which the commands are executed and any concurrency is left to the judgement of the implementation with the only constraint being that the final result must look exactly as if all the commands would have been executed one after another in the very order they were called by the client program.
EDIT: Certain OpenGL calls cause an implicit or explicit synchronization. Reading back pixels for example or waiting for a synchronization event.

Parallel Thread Execution to achieve performance

I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?

Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).

http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.

Windows C++ Process vs Thread

In Windows C++, createThread() causes some of the threads to slow down if one thread is doing a very CPU intensive operation. Will createProcess() alleviate this? If so, does createProcess() imply the code must reside in a second executable, or can this all take place inside the same executable?

The major difference between a process and a thread is that each process has its own memory space, while threads share the memory space of the process that they are running within.
If a thread is truly CPU bound, it will only slow another thread if they are both executing on the same processor core. createProcess will not alleviate this since a process would still have the same issue.
Also, what kind of machine are you running this on? Does it have more than one core?

Not likely - a process is much "heavier" than a thread, so it is likely to be slower still. I'm not sure what you're asking about the 2nd executable, but you can use createProcess on the same .exe.
http://msdn.microsoft.com/en-us/library/ms682425(v=vs.85).aspx
It sounds like you're chasing down some performance issues, so perhaps trying out a threading-oriented profiler would be helpful: http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-threads-philosophy-and-theory/

Each process provides the resources needed to execute a program. A process has a virtual address space, executable code, open handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution. Each process is started with a single thread, often called the primary thread, but can create additional threads from any of its threads.
A thread is the entity within a process that can be scheduled for execution. All threads of a process share its virtual address space and system resources. In addition, each thread maintains exception handlers, a scheduling priority, thread local storage, a unique thread identifier, and a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process. Threads can also have their own security context, which can be used for impersonating clients.

Create process and create thread both cause additional execution on what is a resource limited environment. Meaning no matter how you do parallel processing at some point in time your other lines of execution will imped the current. It is for this reason for very large problems that are suited to parallization distributed system are used. There are pluses and minuses tho threads and processes.
Threads
Threads allow separate execution inside of one address space meaning you can share data variables instances of objects very easily, however it also means you run into many more synchronization issues. These are painfull and as you can see from the shear number of api function involved not a light subject. Threads are a lighter weight on windows then process and as such spin up and down faster and use less resources to maintain. Threads also suffer in that one thread can cause the entire process to fail.
Processes
Process each have there own address space and as such protect themselves from being brought down by another process, but lack the ability to easily communicate. Any communication will necessarily involve some type of IPC ( Pipes, TCP , ...).
The code does not have to be in a second executable just two instances need to run.

That would make things worse. When switching threads, the CPU needs to swap out only a few registers. Since all threads of a process share the same memory, there's no need to flush the cache. But when switching betweeen processes, you also switch mapped memory. Therefore, the CPU has to flush the L1 cache. That's painful.
(L2 cache is physically mapped, i.e. uses hardware addresses. Those don't change, of course.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js