Multiple CUDA contexts for one device - any sense?

Multiple CUDA contexts for one device - any sense? - c++

I thought I had the grasp of this but apparently I do not:) I need to perform parallel H.264 stream encoding with NVENC from frames that are not in any of the formats accepted by the encoder so I have a following code pipeline:
A callback informing that a new frame has arrived is called
I copy the frame to CUDA memory and perform the needed color space conversions (only the first cuMemcpy is synchronous, so I can return from the callback, all pending operations are pushed in a dedicated stream)
I push an event onto the stream and have another thread waiting for it, as soon as it is set I take the CUDA memory pointer with the frame in the correct color space and feed it to the decoder
For some reason I had the assumption that I need a dedicated context for each thread if I perform this pipeline in parallel threads. The code was slow and after some reading I understood that the context switching is actually expensive, and then I actually came to the conclusion that it makes no sense since in a context owns the whole GPU so I lock out any parallel processing from other transcoder threads.
Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?

Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
You should be fine with a single context.
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?
The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process' use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.
If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you've discovered, it's possible to create multiple contexts from a single process, but not usually necessary.
And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.
CUDA runtime API usage manages contexts for you. You normally don't explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.

Obviously a few years have passed, but NVENC/NVDEC now appear to have CUstream support as of version 9.1 (circa September 2019) of the video codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk/download
NEW to 9.1- Encode: CUStream support in NVENC for enhanced parallelism between CUDA pre-processing and NVENC encoding
I'm super new to CUDA, but my basic understanding is that CUcontexts allow multiple processes to use the GPU (by doing context swaps that interrupt each other's work), while CUstreams allow for a coordinated sharing of the GPU's resources from within a single process.

Related

Asynchronous rendering model in vulkan

Recently I'm reading https://github.com/ARM-software/vulkan_best_practice_for_mobile_developers/blob/master/samples/vulkan_basics.md, and it said:
OpenGL ES uses a synchronous rendering model, which means that an API call must behave as if all earlier API calls have already been processed. In reality no modern GPU works this way, rendering workloads are processed asynchronously and the synchronous model is an elaborate illusion maintained by the device driver. To maintain this illusion the driver must track which resources are read or written by each rendering operation in the queue, ensure that workloads run in a legal order to avoid rendering corruption, and ensure that API calls which need a data resource block and wait until that resource is safely available.
Vulkan uses an asynchronous rendering model, reflecting how the modern GPUs work. Applications queue rendering commands into a queue, use explicit scheduling dependencies to control workload execution order, and use explicit synchronization primitives to align dependent CPU and GPU processing.
The impact of these changes is to significantly reduce the CPU overhead of the graphics drivers, at the expense of requiring the application to handle dependency management and synchronization.
Could someone help explain why asynchronous rendering model could reduce CPU overhead? Since in Vulkan you still have to track state yourself.

Could someone help explain why asynchronous rendering model could
reduce CPU overhead?
First of all, let's get back to the original statement you are referring to, emphasis mine:
The impact of these changes is to significantly reduce the CPU
overhead of the graphics drivers, [...]
So the claim here is that the driver itself will need to consume less CPU, and it is easy to see as it can more directly forward your requests "as-is".
However, one overall goal of a low-level rendering API like Vulkan is also a potentially reduced CPU overhead in general, not only in the driver.
Consider the following example: You have a draw call which renders to a texture. And then you have another draw call which samples from this texture.
To get the implicit synchronization right, the driver has to track the usage of this texture, both as render target and as source for texture sampling operations.
It doesn't know in advance if the next draw call will need any resources which are still to be written to in previous draw calls. It has to always track every possible such conflicts, no matter if they can occur in your application or not. And it also must be extremely conservative in its decisions. It might be possible that you have a texture bound for a framebuffer for a draw call, but you may know that with the actual uniform values you set for this shaders the texture is not modified. But the GPU driver can't know that. If it can't rule - out with absolute certainty - that a resource is modified, it has to assume it is.
However, your application will more like know such details. If you have several render passes, and the second pass will depend on the texture rendered to in the first, you can (and must) add proper synchronization primitives - but the GPU driver doesn't need to care why there is any synchronization necessary at all, and it doesn't need track any resource usage to find out - it can just do as it is told. And your application also doesn't need to track it's own resource usage in many cases. It is just inherent from the usage as you coded it that a synchronization might be required at some point. There might be still cases where you need to track your own resource usage to find out though, especially if you write some intermediate layer like some more high-level graphics library where you know less and less of the structure of the rendering - then you are getting into a position similar to what a GL driver has to do (unless you want to forward all the burden of synchronization on the users of your library, like Vulkan does).

How to execute parallel compute shaders across multiple compute queues in Vulkan?

Update: This has been solved, you can find further details here: https://stackoverflow.com/a/64405505/1889253
A similar question was asked previously, but that question was initially focused around using multiple command buffers, and triggering the submit across different threads to achieve parallel execution of shaders. Most of the answers suggest that the solution is to use multiple queues instead. The use of multiple queues also seems to be the consensus across various blog posts and Khronos forum answers. I have attempted those suggestions running shader executions across multiple queues but without being able to see parallel execution, so I wanted to ask what I may be doing wrong. As suggested, this question includes the runnable code of multiple compute shaders being submitted to multiple queues, which hopefully can be useful for other people looking to do the same (once this is resolved).
The current implementation is in this pull request / branch, however I will cover the main Vulkan specific points, to ensure only Vulkan knowledge is required to answer this question. It's also worth mentioning that the current use-case is specifically for compute queues and compute shaders, not graphics or transfer queues (although insights/experience achieving parallelism across those would still be very useful, and would most probably also lead to the answer).
More specifically, I have the following:
Multiple queues first are "fetched" - my device is a NVIDIA 1650, and supports 16 graphics+compute queues in queue family index 0, and 8 compute queues in queue family index 2
evalAsync performs the submission (which contains recorded shader commands) - You should notice that a fence is created which we'll be able to use. Also the submit doesn't have any waitStageMasks (PipelineStageFlags).
evalAwait allows us to wait for the fence - When calling the evalAwait, we are able to wait for the submission to finish through the created fence
A couple of points that are not visible in the examples above but are important:
All evalAsync run on the same application, instance and device
Each evalAsync executes with its own separate commandBuffer and buffers, and in a separate queue
If you are wondering whether memory barriers could be having something to do, we have tried by removing all memoryBarriers (this on for example that runs before shader execution) completely but this has not made any difference on performance
The test that is used in the benchmark can be found here, however the only key things to understand are:
This is the shader that we use for testing, as you can see, we just add a bunch of atomicAdd steps to increase the amount of processing time
Currently the test has small buffer size and high number of shader loop iterations, but we also tested with large buffer size (i.e. 100,000 instead of 10), and smaller iteration (1,000 istead of 100,000,000).
When running the test, we first run a set of "synchronous" shader executions on the same queue (the number is variable but we've tested with 6-16, the latter which is the max number of queues). Then we run these in an asychrnonous manner, where we run all of them and the evalAwait until they are finished. When comparing the resulting times from both approaches, they take the same amount of time eventhough they run across different compute queues.
My questions are:
Am I currently missing something when fetching the queues?
Are there further parameters in the vulkan setup that need to be configured to ensure asynchronous execution?
Are there any restrictions I may not be aware about around potentially operating system processes only being able to submit GPU workloads in a synchronous way to the GPU?
Would multithreading be required in order for parallel execution to work properly when dealing with multiple queue submissions?
Furthermore I have found several useful resources online across various reddit posts and Khronos Group forums that provide very in-depth conceptual and theoretical overviews on the topic, but I haven't come across end to end code examples that show parallel execution of shaders. If there are any practical examples out there that you can share, which have funcioning parallel execution of shaders, that would be very helpful.
If there are further details or questions that can help provide further context please let me know, happy to answer them and/or provide more detail.
For completeness, my tests were using:
Vulkan SDK 1.2
Windows 10
NVIDIA 1650
Other relevant links that have been shared in similar posts:
Similar discussion with suggested link to example but which seems to have disappeared...
Post on Leveraging asynchronous queues for concurrent execution (unfortunately no example code)
(Relatively old - 5 years) Post that suggests nvidia cards can't do parallel execution of shaders, but doesn't seem to have a conculsive answer
Nvidia presentation on Vulkan Multithreading with multiple queue execution (hence my question above on threads)

You are getting "asynchronous execution". You just don't expect it to behave the way it behaves.
On a CPU, if you have one thread active, then you're using one CPU core (or hyper-thread). All of that core's execution and computation capabilities are given to your thread alone (ignoring pre-emption). But at the same time, if there are other cores, your one thread cannot use any of the computational resources of those cores. Not unless you create another thread.
GPUs don't work that way. A queue is not like a CPU thread. It does not specifically relate to a particular quantity of computational resources. A queue is merely the interface through which commands get executed; the underlying hardware decides how to farm out commands to the various compute resources provided by the GPU as a whole.
What generally happens when you execute a command is that the hardware attempts to fully saturate the available shader execution units using your command. If there are more shader units available than the number of invocations your operation requires, then some resources are available immediately for the next command. But if not, then the entire GPU's compute resources will be dedicated to executing the first operation; the second one must wait for resources to become available before it can start.
It doesn't matter how many compute queues you shove work into; they're all going to try to use as many compute resources as possible. So they will largely execute in some particular order.
Queue priority systems exist, but these mainly help determine the order of execution for commands. That is, if a high-priority queue has some commands that need to be executed, then they will take priority the next time compute resources become available for a new command.
So submitting 3 dispatch batches on 3 separate queues is not going to complete faster than submitting 1 batch on one queue containing 3 dispatch operations.
The main reason multiple queues (of the same family) exist is to be able to submit work from multiple threads without having them do inter-thread synchronization (and to provide some possible prioritization of submissions).

I have been able to solve using this suggestion. To provide further context, I was trying to submit commands to multiple queues within the same family, however it was pointed out in the suggestion linked, NVIDIA (and other GPU vendors) have a varying range of capabilities when it comes to parallel processing of command submissions.
In my particular case, the NVIDIA 1650 card I was testing with, only supports concurrent processing when workloads are submitted in different queueFamilies - more specifically, it is only able to support one concurrent command submission across one Graphics queue and one compute family queue.
I re-implemented the code to allow for allocation of family queues for specific commands, and I was able to achieve parallel processing (with a 2x speed improvement by submitting across two queueFamilies).
Here is further detail on the implementation https://kompute.cc/overview/async-parallel.html

Vulkan: Creating and benefit of pipeline derivatives

In Vulkan, you can use vkCreateGraphicsPipeline or vkCreateComputePipeline to create pipeline derivates, with the basePipelineHandle or basePipelineIndex members of VkGraphicsPipelineCreateInfo/VkComputePipelineCreateInfo. The documentation states that this feature is available for performance reasons:
The goal of derivative pipelines is that they be cheaper to create using the parent as a starting point, and that it be more efficient (on either host or device) to switch/bind between children of the same parent.
This raises quite a few questions for me:
Is there a way to indicate which state is shared between parent and child pipelines, or does the implementation decide?
Is there any way to know whether the implementation is actually getting any benefit from using derived pipelines (other than profiling)?
The parent pipeline needs to be created with VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT. Is there a downside to always using this flag (eg. in case you may create a derived pipeline from this one in the future)?

I came to this question investigating whether pipeline derivatives provide a benefit. Here's some resources I found from vendors:
Tips and Tricks: Vulkan Dos and Don’ts, Nvidia, June 6, 2019
Don’t expect speedup from Pipeline Derivatives.
Vulkan Usage Recommendations, Samsung
Pipeline derivatives let applications express "child" pipelines as incremental state changes from a similar "parent"; on some architectures, this can reduce the cost of switching between similar states. Many mobile GPUs gain performance primarily through pipeline caches, so pipeline derivatives often provide no benefit to portable mobile applications.
Recommendations
Create pipelines early in application execution. Avoid pipeline creation at draw time.
Use a single pipeline cache for all pipeline creation.
Write the pipeline cache to a file between application runs.
Avoid pipeline derivatives.
Vulkan Best Practice for Mobile Developers - Pipeline Management, Arm Software, Jul 11, 2019
Don't
Create pipelines at draw time without a pipeline cache (introduces performance stutters).
Use pipeline derivatives as they are not supported.
Vulkan Samples, LunarG, API-Samples/pipeline_derivative/pipeline_derivative.cpp
/*
VULKAN_SAMPLE_SHORT_DESCRIPTION
This sample creates pipeline derivative and draws with it.
Pipeline derivatives should allow for faster creation of pipelines.
In this sample, we'll create the default pipeline, but then modify
it slightly and create a derivative. The derivatve will be used to
render a simple cube.
We may later find that the pipeline is too simple to show any speedup,
or that replacing the fragment shader is too expensive, so this sample
can be updated then.
*/
It doesn't look like any vendor is actually recommending the use of pipeline derivatives, except maybe to speed up pipeline creation.
To me, that seems like a good idea in theory on a theoretical implementation that doesn't amount to much in practice.
Also, if the driver is supposed to benefit from a common parent of multiple pipelines, it should be completely able to automate that ancestor detection. "Common ancestors" could be synthesized based on whichever specific common pipeline states provide the best speed-up. Why specify it explicitly through the API?

Is there a way to indicate which state is shared between parent and child pipelines
No; the pipeline creation API provides no way to tell it what state will change. The idea being that, since the implementation can see the parent's state, and it can see what you ask of the child's state, it can tell what's different.
Also, if there were such a way, it would only represent a way for you to accidentally misinform the implementation as to what changed. Better to just let the implementation figure out the changes.
Is there any way to know whether the implementation is actually getting any benefit from using derived pipelines (other than profiling)?
No.
The parent pipeline needs to be created with VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT. Is there a downside to always using this flag (eg. in case you may create a derived pipeline from this one in the future)?
Probably. Due to #1, the implementation needs to store at least some form of the parent pipeline's state, so that it can compare it to the child pipeline's state. And it must store this state in an easily readable form, which will probably not be the same form as the GPU memory and tokens to be copied into the command stream. As such, there's a good chance that parent pipelines will allocate additional memory for such data. Though the likelihood of them being slower at binding/command execution time is low.
You can test this easily enough by passing an allocator to the pipeline creation functions. If it allocates the same amount of memory as without the flag, then it probably isn't storing anything.

I'm no expert in computer graphics, but my understanding (partly includes intuition) is the following:
Is there a way to indicate which state is shared between parent and child pipelines, or does the implementation decide?
There are certain aspects of the pipeline that are not specified at render time (and so are fixed), for example which shaders to use. My speculation is that the derived from and the derived pipelines likely share these "read-only" information (or in C terms, they point to the same object). That's why creation of derived pipelines is faster.
Switching between these pipelines would also be faster because there is less need to change resources on changing pipelines, because some of the resources are shared and the same.
The parent pipeline needs to be created with VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT. Is there a downside to always using this flag (eg. in case you may create a derived pipeline from this one in the future)?
This is very likely implementation-dependent. My speculation is that, when you allow derivatives, you enable resource (e.g. shader) sharing, which means the implementation is likely going to do reference counting for these resources. That would be an unnecessary cost if the resources are not going to be shared. Also, when changing pipelines, the driver wouldn't need to check whether each resource is shared and can stay on the GPU, or is not and needs changing. If there is no sharing, all resources would be changed, and there is no overhead of checking. None of these are that much of an overhead, so either Vulkan is staying on the safe side, or there is another reason I don't know about.

OpenGL: which OpenGL implementations are not pipelined?

In OpenGL wiki on Performance, it says:
"OpenGL implementations are almost always pipelined - that is to say,
things are not necessarily drawn when you tell OpenGL to draw them -
and the fact that an OpenGL call returned doesn't mean it finished
rendering."
Since it says "almost", that means there are some mplementations are not pipelined.
Here I find one:
OpenGL Pixel Buffer Object (PBO)
"Conventional glReadPixels() blocks the pipeline and waits until all
pixel data are transferred. Then, it returns control to the
application. On the contrary, glReadPixels() with PBO can schedule
asynchronous DMA transfer and returns immediately without stall.
Therefore, the application (CPU) can execute other process right away,
while transferring data with DMA by OpenGL (GPU)."
So this means conventional glReadPixels() (not with PBO) blocks the pipeline.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Then I am wondering:
which OpenGL implementations are not pipelined?
How about glDrawArrays?

The OpenGL specification itself does not specify the term "pipeline" but rather "command stream". The runtime behavior of command stream execution is deliberately left open, to give implementors maximal flexibility.
The important term is "OpenGL sychronization point": https://www.opengl.org/wiki/Synchronization
Here I find one: (Link to songho article)
Note that this is not an official OpenGL specification resource. The wording "blocks the OpenGL pipeline" is a bit unfortunate, because it gets the actual blocking and bottleneck turned "upside down". Essentially it means, that glReadPixels can only return once all the commands leading up to the image it will fetch have been executed.
So this means conventional glReadPixels() (not with PBO) blocks the pipeline. But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Actually it's not the OpenGL pipeline that gets blocked, but the execution of the program on the CPU. It means, that the GPU sees no further commands coming from the CPU. So the pipeline doesn't get "blocked" but in fact drained. When a pipeline drains, or needs to be restarted one says the pipeline has been stalled (i.e. the flow in the pipeline came to a halt).
From the GPUs point of view everything happens with maximum throughput: Render the stuff until the point glReadPixels got called, do a DMA transfer, unfortunately no further commands are available after initiating the transfer.
How about glDrawArrays?
glDrawArrays returns immediately after the data has been queued and necessary been made.

Actually it means that this specific operation can't be pipelined because all data needs to be transfered before the function returns, it doesn't mean other things can't be.
Operations like that are said to stall the pipeline. One function that will always stall the pipeline is glFinish.
Usually when the function returns a value like getting the contents of a buffer, it will induce a stall.
Depending on the driver implementation creating programs and buffers and such can be done without stalling.

Then I am wondering: which OpenGL implementations are not pipelined?
I could imagine that a pure software implementation might not be pipelined. Not much reason to queue up work if you end up executing it on the same CPU. Unless you wanted to take advantage of multi-threading.
But it's probably safe to say that any OpenGL implementation that uses dedicated hardware (commonly called GPU) will be pipelined. This allows the CPU and GPU to work in parallel, which is critical to get good system performance. Also, submitting work to the GPU incurs a certain amount of overhead, so it's beneficial to queue up work, and then submit it in larger batches.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
True. The man pages don't directly specify which calls cause a synchronization. In general, anything that returns values/data produced by the GPU causes synchronization. Examples that come to mind:
glFinish(). Explicitly requires a full synchronization, which is actually its only purpose.
glReadPixels(), in the non PBO case. The GPU has to finish rendering before you can read back the result.
glGetQueryObjectiv(id, GL_QUERY_RESULT, ...). Blocks until the GPU reaches the point where the query was submitted.
glClientWaitSync(). Waits until the GPU reaches the point where the corresponding glFenceSync() was submitted.
Note that there can be different types of synchronizations that are not directly tied to specific OpenGL calls. For example, in the case where the whole workload is GPU limited, the CPU would queue up an infinite about of work unless there is some throttling. So the driver will block the CPU at more or less arbitrary points to let the GPU catch up to a certain point. This could happen at frame boundaries, but it does not have to be. Similar synchronization can be necessary if memory runs low, or if internal driver resources are exhausted.

OpenGL multithreading/shared context and glGenBuffers

If I plan to use multithreading in OpenGL, should I have separate buffers (from glGenBuffers) for each context?
I do not know much about OpenGL multithreading yet (for now I work in "single" thread). I need to know if I can share buffers already pushed to Video Memory (with glBufferData/glBufferSubData), or I have to keep copy of buffer for another thread.

You do not want to use several contexts with several threads. You really don't.
While this sounds like a good idea, in practice muli-context-multi-thread is complicated, troublesome, and badly supported on the driver side, and it only marginally improves (possibly even reduces!) performance.
What you really want is to have only one thread talk to OpenGL (with one context, obviously), map a buffer, and pass the memory pointer to another thread, preferrably using 3 buffers (3 subbuffers of a 3x sized buffer) with immutable storage and persistent mapping, if this is available.
That, and doing indirect render calls, where a second thread feeds the buffers the indirect call reads from.
Further info on the persistent mapping topic: See in particular slides 22-25 of this GDC2014 presentation, which is basically a remake of Cass Everitt's 2013 SIGGRAPH talk.
See also Everitt's original talk: Beyond porting.

Vaos aren't shared so you'll need to generate a new vao for each object per context or else the behavior will become unpredictable and incorrect upon deletion / creation of a new one. This can be a major source of error. Vbos can be shared, so you just need one vbo per object.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js