Parallel compute shaders execution in Vulkan? - c++

I have several compute shaders (let's call them compute1, compute2 and so on) that have several input bindings (defined in shader code as layout (...) readonly buffer) and several output bindings (defined as layout (...) writeonly buffer). I'm binding buffers with data to their descriptor sets and then trying to execute these shaders in parallel.
What I've tried:
vkQueueSubmit() with VkSubmitInfo.pCommandBuffers holding several primary command buffers (one per compute shader);
vkQueueSubmit() with VkSubmitInfo.pCommandBuffers holding one primary command buffer that was recorded using vkCmdExecuteCommands() with pCommandBuffers holding several secondary command buffers (one per compute shader);
Separate vkQueueSubmit()+vkQueueWaitIdle() from different std::thread objects (one per compute shader) - each command buffer is allocated in separate VkCommandPool and is submitting to own VkQueue with own VkFence, main thread is waiting using threads[0].join(); threads[1].join(); and so on;
Separate vkQueueSubmit() from different detached std::thread objects (one per compute shader) - each command buffer is allocated in separate VkCommandPool and is submitting to own VkQueue with own VkFence, main thread is waiting using vkWaitForFences() with pFences holding fences that where used in vkQueueSubmit() and with waitAll holding true.
What I've got:
In all cases result time is almost the same (difference is less then 1%) as if calling vkQueueSubmit()+vkQueueWaitIdle() for compute1, then for compute2 and so on.
I want to bind the same buffers as inputs for several shaders, but according to time the result is the same if each shader is executed with own VkBuffer+VkDeviceMemory objects.
So my question is:
Is is possible to somehow execute several compute shaders simultaneously, or command buffer parallelism works for graphical shaders only?
Update: Test application was compiled using LunarG Vulkan SDK 1.1.73.0 and running on Windows 10 with NVIDIA GeForce GTX 960.

This depends on the hardware You are executing Your application on. Hardware exports queues which process submitted commands. Each queue, as name suggests, executes command in order, one after another. So if You submit multiple command buffers to a single queue, they will be executed in order of their submission. Internally, GPU can try to parallelize execution of some parts of the submitted commands (like separate parts of graphics pipeline can be processed at the same time). But in general, single queue processes commands sequentially and it doesn't matter if You are submitting graphics or compute commands.
In order to execute multiple command buffers in parallel, You need to submit them to separate queues. But hardware must support multiple queues - it must have separate, physical queues in order to be able to process them concurrently.
But, what's more important - I've read that some graphics hardware vendors simulate multiple queues through graphics drivers. In other words - they expose multiple queues in Vulkan, but internally they are processed by a single physical queue and I think that's the case with Your issue here, results of Your experiments would confirm this (though I can't be sure, of course).

Related

Use opengl-command from different thread

I have two threads: One main-thread (opengl) for 3d-rendering and one thread for logic. How should I connect the threads if I want to create a box-mesh in the rendering thread, if the order comes from the logic-thread?
In this case the logic-thread would use opengl-commands, which is not possible because every opengl-command should only be exectued in the main-thread. I know that I can not share opengl context over different threads (which seems to be a bad idea), so how should I solve this problem? Do there exist some general purpose design pattern or something else? Thanks.
You could implement a draw commands queue. Each draw command contains whatever is needed to make the required OpenGL calls. Each frame the rendering thread empties the current queue and processes the commands. Any other thread prepares its own commands and enqueues them at any time to the queue for the next frame.
Very primitive draw commands can be implemented as a class hierarchy with virtual Draw method. Of course this is not a small change at all but modern engines adopt this approach, of course much more advanced version of it. It can be efficient if the subsystems which submitted their command objects re-use them in the next frame, including their buffers. So each submodule constantly prepares and updates the draw command but submits it only when it should be rendered based on some logic.
There are various ways to approach this. One is to implement a command queue with the logic thread being a command producer and the rendering thread the consumer.
Another approach is to make use of an auxiliary OpenGL context, which is setup to share the primary OpenGL context data. You can have both contexts made current at the same time in different threads. And for OpenGL-3.x core onward, you can make current a context without a drawable. You can then use the auxiliary context to load new data, map buffers and so on.

Device to device copy in Vulkan

I want to copy an image/buffer between two GPUs/physical devices in my Vulkan application (one vkInstance, two vkDevices). Is this possible without staging the image on the CPU or is there a feature like CUDA p2p? How would this look?
If staging on the host is required, what would be the optimal method for this?
is there a feature like CUDA p2p?
Vulkan 1.1 supports the concept of device groups to cover this situation.
It allows you to treat a set of physical devices as a single logical device, and also lets you query how memory can be manipulated within the device group, as well as do things like allocate memory on a subset of devices. Check the specifications for the full set of functionality.
Is this possible without staging the image on the CPU
If your devices don't support the extenson VK_KHR_device_group, then no. You must transfer the content through the CPU and system memory.
Since buffers are per-device, you would need two host-visible staging buffers, one for the read operation, and another for the write operation. You'll also need two queues, two command buffers, etc, etc...
You'll have to execute 3 operations with manual synchronization.
On the source GPU execute a copy from the device-local buffer to the host visible buffer for the same device.
On the CPU copy from the source GPU host visible buffer to the target GPU host-visible buffer
On the target GPU copy from the host-visible buffer to the device-local buffer
Make sure to inspect your device queue family properties and if possible use a queue from a queue family that is marked as transfer capable but not graphics or compute capable. The fewer flags a Vulkan queue family has, the better suited it is to the operations that it does have flags for. Most modern discrete GPUs have dedicated transfer queues, but again, queues are specific to devices, so you'll need to be interacting with one queue for each device to execute the transfer.
If staging on the host is required, what would be the optimal method for this?
Exactly how to execute this depends on your use case. If you want to execute the whole thing synchronously in a single thread, then you'll just be doing a bunch of submits and then waiting on fences. If you want to do it asynchronously in the background while you continue to render frames, then you'll still be doing the submits, but you'll have to non-blocking checking on the fences to see when operations complete before you move to the next part.
If you're transferring buffers there's probably nothing to be worried about in terms of optimal transfer, but if you're dealing with images then you have to get into the whole linear vs optimal image tiling mess. In order to avoid that I'd suggest using host visible buffers for staging, regardless of whether you're transferring images or buffers, and as such use vkCmdCopyImageToBuffer and vkCmdCopyBufferToImage to do the transfers between device-local and host-visible memory

Open GL: multithreaded glFlushMappedBufferRange?

I know that multi threaded OpenGL is a delicate topic and I am not trying here to render from multiple threads. I also do not try to create multiple contexts and share objects with share lists. I have a single context and I issue draw commands and gl state changes only from the main thread.
However, I am dynamically updating parts of a VBO in every frame. I only write to the VBO, I do not need to read it on the CPU side. I use glMapBufferRange so I can compute the changed data on the fly and don't need an additional copy (which would be created by the blocking glBufferSubData).
It works and now I would like to multi thread the the data update (since it needs to update a lot of vertices at steady 90 fps) and use a persistently mapped buffer (using GL_MAP_PERSISTENT_BIT). This will require to issue glFlushMappedBufferRange whenever a worker thread finished updating parts of the mapped buffer.
Is it fine to call glFlushMappedBufferRange on a separate thread? The Ranges the different threads operate on do not overlap. Is there an overhead or implicit synchronisation involved in doing so?
No you need to call glFlushMappedBufferRange in the thread that does the openGL stuff.
To overcome this you have 2 options:
get the openGL context and make it current in the worker thread. Which means the openGL thread has to relinquish the context for it to work.
push the relevant range into a thread-safe queue and let the openGL thread pop each range from it and call glFlushMappedBufferRange.

Running CUDA and OpenGL in parallel without using interoperability

I am building a real-time signal processing and display system using a nVidia Tesla C2050 GPU. The design was such that the signal processing part would run as a separate program and do all the computations using CUDA. In parallel, if needed I can start a separate display program which displays the processed signal using OpenGL.Since the design was to run the processes as independent processes, I do not have any CUDA-OpenGL interoperability These two programs exchange data with each other over a UNIX stream socket.
The signal processing program spends most of the time using the GPU for the CUDA stuff.I am refreshing my frame in OpenGL every 50 msecs while the CUDA program runs for roughly 700 msecs for each run and two sequential runs are usually separated by 30-40 msecs. When I run the programs one at a time (i.e. only CUDA or OpenGL part is running) everything works perfectly. But when I start the programs together, the display is also not what it is supposed to be, while the CUDA part produces the correct output. I have checked the socket implementation and I am fairly confident that the sockets are working correctly.
My question is since I have a single GPU and no CUDA-OpenGL interoperability and both the processes use the GPU regularly, is it possible that the context switching between the CUDA kernel and the OpenGL kernel is interfering with each other. Should I change the design to have a single program to run bot the parts with CUDA-OpenGL interoperability.
Compute capability 5.0 and less devices cannot run graphics and compute concurrently. The Tesla C2050 does not support any form of pre-emption so while the CUDA kernel is executing the GPU cannot be used to render the OpenGL commands. CUDA-OpenGL interop does not solve this issue.
If you have a single GPU then the best option is to break the CUDA kernels into shorter launches so that the GPU can switch between compute in graphics. In the aformentioned case the CUDA kernel should not execute for more than 50ms - GLRenderTime.
Using a second GPU to do the graphics rendering would be the better option.

Compute Shader basics in dx11

I am about to add compute shader support to my codebase and having problems finding answers to some pretty basic questions:
All documentation out there says that Compute Shader pipeline runs independently from the GPU, however all dx11 sample code uses the device context interface to set the shader itself, resource views and calling the dispatch() method, so do these get queued up in the command buffer with the rest of the rendering commands or do they get executed independently?
Following up on question 1, can I invoke compute shaders from multiple threads or do I need to buffer all compute shader commands and issue them on the thread that the immediate device context was created on?
Synchronization. Most articles use the CopyResource command which will automatically synchronize compute shader completion and give CPU access to the results, but seems like that would block the GPU as well. Is there a more efficient way to synchronize?
I know I could find answers to this by experimenting, but any help that saves me time would be appreciated.
The Compute Shader pipeline runs independently from the Rendering pipeline, i.e. vertex shaders, pixel shaders, blend states, etc. have no effect on what happens when you call Dispatch(). However, they do go into the same queue, so ordering between calls to Draw and Dispatch are preserved.
All calls to the immediate context must be done from a single thread.
One common approach is to use two buffers. While one is being operated on with the compute shader, the other is being copied back and read by the CPU. Most GPUs will be able to parallelize this.