Device to device copy in Vulkan - c++

I want to copy an image/buffer between two GPUs/physical devices in my Vulkan application (one vkInstance, two vkDevices). Is this possible without staging the image on the CPU or is there a feature like CUDA p2p? How would this look?
If staging on the host is required, what would be the optimal method for this?

is there a feature like CUDA p2p?
Vulkan 1.1 supports the concept of device groups to cover this situation.
It allows you to treat a set of physical devices as a single logical device, and also lets you query how memory can be manipulated within the device group, as well as do things like allocate memory on a subset of devices. Check the specifications for the full set of functionality.
Is this possible without staging the image on the CPU
If your devices don't support the extenson VK_KHR_device_group, then no. You must transfer the content through the CPU and system memory.
Since buffers are per-device, you would need two host-visible staging buffers, one for the read operation, and another for the write operation. You'll also need two queues, two command buffers, etc, etc...
You'll have to execute 3 operations with manual synchronization.
On the source GPU execute a copy from the device-local buffer to the host visible buffer for the same device.
On the CPU copy from the source GPU host visible buffer to the target GPU host-visible buffer
On the target GPU copy from the host-visible buffer to the device-local buffer
Make sure to inspect your device queue family properties and if possible use a queue from a queue family that is marked as transfer capable but not graphics or compute capable. The fewer flags a Vulkan queue family has, the better suited it is to the operations that it does have flags for. Most modern discrete GPUs have dedicated transfer queues, but again, queues are specific to devices, so you'll need to be interacting with one queue for each device to execute the transfer.
If staging on the host is required, what would be the optimal method for this?
Exactly how to execute this depends on your use case. If you want to execute the whole thing synchronously in a single thread, then you'll just be doing a bunch of submits and then waiting on fences. If you want to do it asynchronously in the background while you continue to render frames, then you'll still be doing the submits, but you'll have to non-blocking checking on the fences to see when operations complete before you move to the next part.
If you're transferring buffers there's probably nothing to be worried about in terms of optimal transfer, but if you're dealing with images then you have to get into the whole linear vs optimal image tiling mess. In order to avoid that I'd suggest using host visible buffers for staging, regardless of whether you're transferring images or buffers, and as such use vkCmdCopyImageToBuffer and vkCmdCopyBufferToImage to do the transfers between device-local and host-visible memory

Related

Nvidia NVLink: does Vulkan see 2 GPUs as a single big one? [duplicate]

Would using multi-GPUs in Vulkan be something like making many command queues then dividing command buffers between them?
There are 2 problems:
In OpenGL, we use GLEW to get functions. With more than 1 GPU, each GPU has its own driver. How'd we use Vulkan?
Would part of the frame be generated with a GPU & the others with other GPUs like use Intel GPU to render UI & AMD or Nvidia GPU to render game screen in labtops for example? Or would a frame be generated in a GPU & the next frame in an another GPU?
Updated with more recent information, now that Vulkan exists.
There are two kinds of multi-GPU setups: where multiple GPUs are part of some SLI-style setup, and the kind where they are not. Vulkan supports both, and supports them both in the same computer. That is, you can have two NVIDIA GPUs that are SLI-ed together, and the Intel embedded GPU, and Vulkan can interact with them all.
Non-SLI setups
In Vulkan, there is something called the Vulkan instance. This represents the base Vulkan system itself; individual devices register themselves to the instance. The Vulkan instance system is, essentially, implemented by the Vulkan SDK.
Physical devices represent a specific piece of hardware that implements the interface to a GPU. Each piece of hardware that exposes a Vulkan implementation does so by registering its physical device with the instance system. You can query which physical devices are available, as well as some basic properties about them (their names, how much memory they offer, etc).
You then create logical devices for the physical devices you use. Logical devices are how you actually do stuff in Vulkan. They have queues, command buffers, etc. And each logical device is separate... mostly.
Now, you can bypass the whole "instance" thing and load devices manually. But you really shouldn't. At least, not unless you're at the end of development. Vulkan layers are far too critical for day-to-day debugging to just opt out of that.
There are mechanisms, core in Vulkan 1.1, that allow individual devices to be able to communicate some information to other devices. In 1.1, only certain kinds of information can be shared across physical devices (namely, fences and semaphores, and even then, only on Linux through sync files). While these APIs could provide a mechanism for sharing data between two physical devices, at present, the restriction on most forms of data sharing is that both physical devices must have matching UUIDs (and therefore are the same physical device).
SLI setups
Dealing with SLI is covered by two Vulkan 1.0 extensions: KHR_device_group and KHR_device_group_creation. The former is for dealing with "device groups" in Vulkan, while the latter is an instance extension for creating device-grouped devices. Both of these are core in Vulkan 1.1.
The idea with this is that the SLI aggregation is exposed as a single VkDevice, which is created from a number of VkPhysicalDevices. Each internal physical device is a "sub-device". You can query sub-devices and some properties about them. Memory allocations are specific to a particular sub-device. Resource objects (buffers and images) are not specific to a sub-device, but they can be associated with different memory allocations on the different sub-devices.
Command buffers and queues are not specific to sub-devices; when you execute a CB on a queue, the driver figures out which sub-device(s) it will run on, and fills in the descriptors that use the images/buffers with the proper GPU pointers for the memory that those images/buffers have been bound to on those particular sub-devices.
Alternate-frame rendering is simply presenting images generated from one sub-device on one frame, then presenting images from a different sub-device on another frame. Split-frame rendering is handled by a more complex mechanism, where you define the memory for the destination image of a rendering command to be split among devices. You can even do this with presentable images.
In vulkan you need to enumerate the devices and select the one you want to work with. There will be nothing stopping you from trying to work with 2 different ones separately. Each vulkan call needs at least 1 parameter as context. The loader layer will then forward the call to the correct driver. Or you can load the functions for each device separately to avoid the loader's trampoline.
A generated frame will need to be forwarded to the card that is connected to the screen for display. So it's more likely that 1 GPU is responsible for graphics and the others are used for physics.
Only a single device can be connected to a specific surface at a time so that device needs to get the rendered frame to copy it into the renderable image that gets pushed to the screen.
Device group is the way to go. Look at the vulkan specification for documentation. Vulkan handle all the dispatch to the others GPUs (when they are connected by sli/crossfire). All you need to do is to tell vulkan how the dispatch is done (for example dispatch one frame on a GPU and the next on another one). If you need to do compute work you will need to address each GPU individually. Please find a link for a reference: https://www.ea.com/seed/news/khronos-munich-2018-halcyon-vulkan

In D3D12, can the render target view be any buffer?

In the samples I have looked at so far some of the commands are something like:
D3D12_DESCRIPTOR_HEAP_DESC with D3D12_DESCRIPTOR_HEAP_TYPE::D3D12_DESCRIPTOR_HEAP_TYPE_RTV
ID3D12Device::CreateDescriptorHeap
D3D12_CPU_DESCRIPTOR_HANDLE with ID3D12DescriptorHeap::GetCPUDescriptorHandleForHeapStart (and maybe ID3D12Device::GetDescriptorHandleIncrementSize)
(Optional?) ID3D12Device::CreateRenderTargetView
(Optional?) IDXGISwapChain3::GetBuffer
Schedule rendering stuff to a command list, of note OMSetRenderTargets and DrawInstanced, and then close the command list.
ID3D12CommandQueue::ExecuteCommandLists
(Optional?) IDXGISwapChain3::Present
Schedule a signal with ID3D12CommandQueue::Signal on a fence
Wait for the GPU to finish with ID3D12Fence::SetEventOnCompletion and WaitForSingleObjectEx
If possible, how can step 4.1 be replaced with a buffer of choice? I.e. how can one create a ID3D12Resource* and render to it, and then read from it into say a std::vector? (I assume if this is possible that step 6.1 can be ignored, since there is no render target view for the swap chain to present to. Perhaps step 4 is unnecessary as well in this case ? Maybe only OMSetRenderTargets matters ?)
It depends on the video memory architecture as to where exactly the render target can be located. On some systems, it's in dedicated video memory that only the video card can access. In some systems it's in video memory shared across the bus that both CPU and GPU can access. In unified memory architectures, everything is in system memory.
Therefore, you have restrictions on where exactly the render target can be located. This is why you have to use D3D12_HEAP_TYPE_DEFAULT and specify D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET when creating the ID3D12Resource you plan to bind as a render target (this is implicitly done by DXGI when you create the swap chain render target as well).
Generally speaking you can't and shouldn't use the low-level DXGI surface creation APIs to create Direct3D resources. They mostly exist for system use rather than by applications.
Unless you happen to be on a UMA system, you should minimize the CPU access to the render target as it will require expensive copies otherwise. Even on UMA systems, there's also required de-tiling as well to get the results into a linear form.
Direct3D 12 also offers the "Placed Resource" methods as well which can provide more control over where exactly the memory is allocated (or more specifically where the virtual memory addresses are allocated), but you still have to abide by the underlying architecture limitations. Depending on the memory architecture, you can "alias" multiple different ID3D12Resource instances all using the same memory (such as a render target being be aliased as a unordered access resource), but you are responsible for inserting the required resource barriers into the command list (and test it) to make sure that it works reliably on all DX12 hardware. See MSDN.
You do not have to Present your render target if you don't need the user to see the result.
Memory Management Strategies
UMA Optimizations: CPU Accessible Textures and Standard Swizzle
Getting Started with Direct3D 12
If you are new to DirectX 12, you should see the DirectX Tool Kit for DirectX 12 tutorials. If you aren't already familiar with DirectX 11, you should wait on DirectX 12 and start with DirectX Tool Kit for DirectX 11.

Is there any OpenGL operations cost GPU resource significantly?

The situation I met is that I run several OpenGL programs concurrently on a single server with only one GPU cards. They works fine with the FPS of 60.
But the problem is when I restart one of then, the FPS of the others drop a lot, may to 3X or even 1X
If I restart 2 or more at the same time, it can be worse, FPS can be only single-digit
I'm wondering is there any OpenGL initialization(create context/setup texture) operation that may cost the GPU resource a lot?
The environment: Linux(Ubuntu 14.04) NVIDIA GTX 770 with X11 window system
There are indeed some operations that most OpenGL implementations carry out on the CPU. Most notably when sending images to the GPU any necessary format conversions (i.e. if the image data is not yet in a format that can be processed by the GPU) are done on the CPU. Also every OpenGL object that carries bulk data (textures, render buffers, array buffer objects) usually requires a shadow copy in system memory (the GPU memory mostly acts like a cache) and initializing that is a costly operation as well.
Sending bulk data to the GPU consumes bus bandwidth. While this does not really hit the CPU or GPU due to being carried out using DMA transfer, it consumes memory bandwidth on either side thus impacting memory intensive operations on either side.

How to draw a pixel by changing video memory map directly in a C program (without library functions)

Is it possible to display a black dot by changing values in the screen(video ie monitor) memory map in RAM using a c program?
I don't want to use any library functions as my primary aim is to learn how to develop a simple OS.
I tried accessing the starting screen memory map ie 0xA0000 (in C).
I tried to run the program but got a Segmentation Fault since no direct access is provided. In super user, the program gets executed without any change.
Currently I am testing in VirtualBox.
A "real" operating system will not use the framebuffer at address 0xA0000, so you can't draw on the screen by writing to it directly. Instead your OS probably has proper video drivers that will talk to the hardware in various very involved ways.
In short there's no easy way to do what you want to do on a modern OS.
On the other hand, if you want to learn how to write your own OS, then it would be very good practice to try to write a minimal kernel that can output to the VGA text framebuffer at 0xB8000 and maybe then the VGA graphic framebuffer at 0xA0000.
You can start using those framebuffers and drawing on the screen almost immediately after the BIOS jumps to your kernel, with a minimal amount of setting up. You could do that directly from real mode in maybe a hundred lines of assembler tops, or perhaps in C with a couple lines of assembler glue first.
Even simpler would be to have GRUB set up the hardware, boot your minimal kernel, and you can directly write to it in a couple lines.
Short answer is no because the frame buffer on modern operating systems is setup as determined by the vbios and kernel driver(s). It depends on amount of VRAM present on the board, the size of the GART, physical Ram present and a whole bunch of other stuff (VRAM reservation, whether it should be visible to CPU or not, etc). On top of this, modern OS's are utilizing multiple back buffers and flipping the HW to display between these buffers, so even if you could directly poke to the frame buffer, the address would change from frame to frame.
If you are interesting in do this for learning purposes, I would recommend creating a simple OGL or D3D (for example) 'function' which takes a 'fake' system allocated frame buffer and presents it to the screen using regular HW operations.
You could even set the refresh up on a timer to fake update.
Then your fake OS would just write pixels to the fake system memory buffer and this rendering function would take care of displaying it as if it were real.

How to read a pixel depth value without stalling the pipeline?

Using glReadPixels on 1 single pixel stalls the pipeline even if I have swapped the buffers just before.
I don't need synchronization, I can do something like this:
pixel=DEFAULT_VALUE;
while (1){
draw(pixel);
swapBuffers();
pixel=glRead???;
}
How can I do this in an optimized(not stalling) way?
You can asynchronous pixel transfers via Pixel Buffer Objects (PBOs). When you issue a read call without PBOs, the pipeline is flushed and the CPU has to wait for the GPU to finish rendering and transfering the data. With PBOs, you provide a buffer in advance, and the data will be copied into that buffer when the GPU is ready, so it will not stall. It of course will stall when you try to access that buffer before it is ready (e.g. by glGetBufferSubData() or mapping that buffer for reading etc). So ideally, before reading back the data, you can queue up some other render commands, and also do some other CPU work, before accessing the buffer. The extension spec I linked has an example section, which is quite interesting.
This stuff can also be combined with sync objects. In that case, you can add a fence sync after the read call which will copy the data into the PBO. Then, on the CPU you can actually check if the operation is already completed. If not, you can do some other work and check back.
The main problem with all this asynchronous transfers is that you trade throughput for latency. So if you need that pixel value immediately, and do';t have any other work for the GPU and CPU which can be done inbetween, there is not much to gain. You then cannot really avoid the stalling, then.