I want to copy an image/buffer between two GPUs/physical devices in my Vulkan application (one vkInstance, two vkDevices). Is this possible without staging the image on the CPU or is there a feature like CUDA p2p? How would this look?
If staging on the host is required, what would be the optimal method for this?
is there a feature like CUDA p2p?
Vulkan 1.1 supports the concept of device groups to cover this situation.
It allows you to treat a set of physical devices as a single logical device, and also lets you query how memory can be manipulated within the device group, as well as do things like allocate memory on a subset of devices. Check the specifications for the full set of functionality.
Is this possible without staging the image on the CPU
If your devices don't support the extenson VK_KHR_device_group, then no. You must transfer the content through the CPU and system memory.
Since buffers are per-device, you would need two host-visible staging buffers, one for the read operation, and another for the write operation. You'll also need two queues, two command buffers, etc, etc...
You'll have to execute 3 operations with manual synchronization.
On the source GPU execute a copy from the device-local buffer to the host visible buffer for the same device.
On the CPU copy from the source GPU host visible buffer to the target GPU host-visible buffer
On the target GPU copy from the host-visible buffer to the device-local buffer
Make sure to inspect your device queue family properties and if possible use a queue from a queue family that is marked as transfer capable but not graphics or compute capable. The fewer flags a Vulkan queue family has, the better suited it is to the operations that it does have flags for. Most modern discrete GPUs have dedicated transfer queues, but again, queues are specific to devices, so you'll need to be interacting with one queue for each device to execute the transfer.
If staging on the host is required, what would be the optimal method for this?
Exactly how to execute this depends on your use case. If you want to execute the whole thing synchronously in a single thread, then you'll just be doing a bunch of submits and then waiting on fences. If you want to do it asynchronously in the background while you continue to render frames, then you'll still be doing the submits, but you'll have to non-blocking checking on the fences to see when operations complete before you move to the next part.
If you're transferring buffers there's probably nothing to be worried about in terms of optimal transfer, but if you're dealing with images then you have to get into the whole linear vs optimal image tiling mess. In order to avoid that I'd suggest using host visible buffers for staging, regardless of whether you're transferring images or buffers, and as such use vkCmdCopyImageToBuffer and vkCmdCopyBufferToImage to do the transfers between device-local and host-visible memory
In the samples I have looked at so far some of the commands are something like:
D3D12_DESCRIPTOR_HEAP_DESC with D3D12_DESCRIPTOR_HEAP_TYPE::D3D12_DESCRIPTOR_HEAP_TYPE_RTV
ID3D12Device::CreateDescriptorHeap
D3D12_CPU_DESCRIPTOR_HANDLE with ID3D12DescriptorHeap::GetCPUDescriptorHandleForHeapStart (and maybe ID3D12Device::GetDescriptorHandleIncrementSize)
(Optional?) ID3D12Device::CreateRenderTargetView
(Optional?) IDXGISwapChain3::GetBuffer
Schedule rendering stuff to a command list, of note OMSetRenderTargets and DrawInstanced, and then close the command list.
ID3D12CommandQueue::ExecuteCommandLists
(Optional?) IDXGISwapChain3::Present
Schedule a signal with ID3D12CommandQueue::Signal on a fence
Wait for the GPU to finish with ID3D12Fence::SetEventOnCompletion and WaitForSingleObjectEx
If possible, how can step 4.1 be replaced with a buffer of choice? I.e. how can one create a ID3D12Resource* and render to it, and then read from it into say a std::vector? (I assume if this is possible that step 6.1 can be ignored, since there is no render target view for the swap chain to present to. Perhaps step 4 is unnecessary as well in this case ? Maybe only OMSetRenderTargets matters ?)
It depends on the video memory architecture as to where exactly the render target can be located. On some systems, it's in dedicated video memory that only the video card can access. In some systems it's in video memory shared across the bus that both CPU and GPU can access. In unified memory architectures, everything is in system memory.
Therefore, you have restrictions on where exactly the render target can be located. This is why you have to use D3D12_HEAP_TYPE_DEFAULT and specify D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET when creating the ID3D12Resource you plan to bind as a render target (this is implicitly done by DXGI when you create the swap chain render target as well).
Generally speaking you can't and shouldn't use the low-level DXGI surface creation APIs to create Direct3D resources. They mostly exist for system use rather than by applications.
Unless you happen to be on a UMA system, you should minimize the CPU access to the render target as it will require expensive copies otherwise. Even on UMA systems, there's also required de-tiling as well to get the results into a linear form.
Direct3D 12 also offers the "Placed Resource" methods as well which can provide more control over where exactly the memory is allocated (or more specifically where the virtual memory addresses are allocated), but you still have to abide by the underlying architecture limitations. Depending on the memory architecture, you can "alias" multiple different ID3D12Resource instances all using the same memory (such as a render target being be aliased as a unordered access resource), but you are responsible for inserting the required resource barriers into the command list (and test it) to make sure that it works reliably on all DX12 hardware. See MSDN.
You do not have to Present your render target if you don't need the user to see the result.
Memory Management Strategies
UMA Optimizations: CPU Accessible Textures and Standard Swizzle
Getting Started with Direct3D 12
If you are new to DirectX 12, you should see the DirectX Tool Kit for DirectX 12 tutorials. If you aren't already familiar with DirectX 11, you should wait on DirectX 12 and start with DirectX Tool Kit for DirectX 11.
I was searching for the difference between these two in terms of GPU reading speed and occasionally CPU writing(less then once per frame or even only once). I don't want to use D3D11_USAGE_DYNAMIC cause data will not be updated >= once per frame.
Is there a significant performance increase with Default + Staging combo over the Default buffer?
The best performance advice for Direct3D 11 here is actually the same as Direct3D 10.0. Start by reviewing this talk Windows to Reality: Getting the Most out of Direct3D 10 Graphics in Your Games from Gamefest 2007.
To your question, any updates to a resource (texture or buffer) will have some potential performance impact due, but for 'occasional' updates, the best option is to use a STAGING resource and then CopyResource to a DEFAULT resource for actual rendering.
DYNAMIC for textures should be reserved for very frequent updates (like say video texture playback), and of course for vertex buffers when doing dynamic draw submission. Constant buffers are really intended to be DYNAMIC or use UpdateSubresource which really depends on your update pattern (a topic covered in the talk above).
Whenever possible, creating resources as IMMUTABLE and with pInitialData is the best option with DirectX 11 as it potentially allows the driver some opportunity for multi-threaded resource creation which is more efficient.
The main thing to be aware of with this pattern is that STAGING resources can result in virtual-memory fragmentation which can be a problem for 32-bit (X86) apps, so you should try to use them rather than creating a lot of them or destroying them and recreating them. See the talk "Why Your Windows Game Won't Run In 2,147,352,576 Bytes" which is attached to this blog post.
I'd suggest your initial version would be to try to use DEAFAULT+UpdateSubresource and then compare it with a DEAULT+STAGING+CopyResource solution because it really heavily depends on your content and code.
How to deal with OpenGL cube map textures in CUDA?
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernel and one need to use 2D textures with proper calculations and sampling in user code?
OK - I managed to solve the problem myself, though the solution isn't as simple as using another CUDA function.
To bind a CUDA texture reference with any texture, be it one obtained from OpenGL or D3D, one has to provide a CUDA array that is mapped to a resource, using cuGraphicsSubResourceGetMappedArray to retrieve it. As I mentioned in the question, it is simple in case of a one or two dimensional texture. But with other available types it is more complicated.
At any time we need a CUDA array that the reference is bound to. Same goes with the cube map texture. But in such a case the array has to be a 3D one. The problem is that CUDA driver API provides only the aforementioned function to retrieve a single layer from such a texture resource, and map it to a single, two dimensional array. To get what we want we have to make ourselves the 3D array containing all the layers (or faces in case of a cube map).
First of all we have to get arrays for each layer/face using the above function. Next step is to create the 3D array by call to cuArray3DCreate, fed with proper set of parameters (size/number of layers, level of detail, data format, number of channels per texel and some flags). Then we have to copy the layers' arrays to the 3D one with a series of calls to cuMemcpy3D, one for each layer/face array.
Finally, we set our target CUDA texture reference with cuTexRefSetArray, fed with the 3D array we created and copied to. Inside of the device code we create a reference with proper texture type and mode (float4 and cube map) and sample it with texCubemap.
Below I put a fragment of the function which does all that, available in full length in CIRT Repository (cirt_server.c file, function cirtTexImage3D).
//...
if (result)
{
// Create a 3D array...
CUDA_ARRAY3D_DESCRIPTOR layeredTextureDescr;
layeredTextureDescr.Width = w;
layeredTextureDescr.Height = h;
layeredTextureDescr.Depth = d;
layeredTextureDescr.Format = map_type_to_format(type);
layeredTextureDescr.NumChannels = format == CIRT_RGB ? CIRT_RGBA : format;
layeredTextureDescr.Flags = map_target_to_flags(target);
if (result) result = LogCUDADriverCall(cuArray3DCreate(&hTexRefArray, &layeredTextureDescr),
FUN_NAME(": cuArray3DCreate_tex3D"), __FILE_LINE__);
// Copy the acquired layer/face arrays into the collective 3D one...
CUDA_MEMCPY3D layerCopyDescr;
layerCopyDescr.srcMemoryType = CU_MEMORYTYPE_ARRAY;
layerCopyDescr.srcXInBytes = 0;
layerCopyDescr.srcZ = 0;
layerCopyDescr.srcY = 0;
layerCopyDescr.srcLOD = 0;
layerCopyDescr.dstMemoryType = CU_MEMORYTYPE_ARRAY;
layerCopyDescr.dstLOD = 0;
layerCopyDescr.WidthInBytes = layeredTextureDescr.NumChannels * w;
layerCopyDescr.Height = h;
layerCopyDescr.Depth = target == CIRT_TEXTURE_CUBE_MAP ? 1 : d;
layerCopyDescr.dstArray = hTexRefArray;
for (i = 0; i < num_layers; ++i)
{
layer = ((num_layers == 6) ? CU_CUBEMAP_FACE_POSITIVE_X + i : i);
layerCopyDescr.dstXInBytes = 0;
layerCopyDescr.dstY = 0;
layerCopyDescr.dstZ = i;
layerCopyDescr.srcArray = hLayres[i];
if (result) result = LogCUDADriverCall(cuMemcpy3D(&layerCopyDescr),
FUN_NAME(": cuMemcpy3D_tex3D"), __FILE_LINE__);
}
// Finally bind the 3D array with texture reference...
if (result) LogCUDADriverCall(cuTexRefSetArray(hTexRef, hTexRefArray, CU_TRSA_OVERRIDE_FORMAT),
FUN_NAME(": cuTexRefSetArray_tex3D"), __FILE_LINE__);
if (hLayres)
free(hLayres);
if (result)
current->m_oTextureManager.m_cuTextureRes[current->m_oTextureManager.m_nTexCount++] = hTexResource;
}
//...
I've checked it with cube maps only for now but it should work just fine with 3D texture as well.
I'm not real familiar with CUDA directly but I do have some experience in OpenGL and DirectX and I am also familiar with 3D Graphics Rendering APIs, Libraries and Pipelines and having the ability to setup and use those APIs.
When I look at your question(s):
How to deal with OpenGL cube map textures in CUDA?
And you proceed to explain it by this:
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernal and one need to use 2D textures with proper calculations and sampling in user code?
I went to CUDA's website for their API SDK & Programming Documentations. And found the function in question cuGraphicsSubResourceGetMappedArray()
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
Get an array through which to access a subresource of a mapped graphics resource.
Parameters
pArray - Returned array through which a subresource of resource may be accessed
resource - Mapped resource to access
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
mipLevel - Mipmap level for the subresource to access
Returns
CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE,
CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_NOT_MAPPED,
CUDA_ERROR_NOT_MAPPED_AS_ARRAY
Description
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
If resource is not a texture then it cannot be accessed via an array and CUDA_ERROR_NOT_MAPPED_AS_ARRAY is returned. If arrayIndex is not a valid array index for resource then CUDA_ERROR_INVALID_VALUE is returned. If mipLevel is not a valid mipmap level for resource then CUDA_ERROR_INVALID_VALUE is returned. If resource is not mapped then CUDA_ERROR_NOT_MAPPED is returned.
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cuGraphicsResourceGetMappedPointer
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4ic22V4Dz
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
This function method was found in NVidia CUDA's DriverAPI and not in their RuntimeAPI. When understanding hardware with CUDA capability is that there is a difference between the Host and Device programmable pipelines which can be found here: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#axzz4ic6tFjXR
2. Heterogeneous Computing
CUDA programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more CUDA-enabled NVIDIA GPU devices.
While NVIDIA GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution.
However, the device is based on a distinctly different design from the host system, and it's important to understand those differences and how they determine the performance of CUDA applications in order to use CUDA effectively.
2.1. Differences between Host and Device
The primary differences are in threading model and in separate physical memories:
Threading resources - Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four hex-core processors today can run only 24 threads concurrently (or 48 if the CPUs support Hyper-Threading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.
Threads - Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimize latency for one or two threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.
RAM - The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, items in the host memory must occasionally be communicated across the bus to the device memory or vice versa as described in What Runs on a CUDA-Enabled Device?
These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document. Applications composed with these differences in mind can treat the host and device together as a cohesive heterogeneous system wherein each processing unit is leveraged to do the kind of work it does best: sequential work on the host and parallel work on the device.
Read more at: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#ixzz4ic8ch2fq
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
Now knowing that there are two different APIs for CUDAs API Libraries we have to understand the difference between the two found here: Difference Between the driver and runtime APIs
1. Difference between the driver and runtime APIs
The driver and runtime APIs are very similar and can for the most part be used interchangeably. However, there are some key differences worth noting between the two.
Complexity vs. control
The runtime API eases device code management by providing implicit initialization, context management, and module management. This leads to simpler code, but it also lacks the level of control that the driver API has.
In comparison, the driver API offers more fine-grained control, especially over contexts and module loading. Kernel launches are much more complex to implement, as the execution configuration and kernel parameters must be specified with explicit function calls. However, unlike the runtime, where all the kernels are automatically loaded during initialization and stay loaded for as long as the program runs, with the driver API it is possible to only keep the modules that are currently needed loaded, or even dynamically reload modules. The driver API is also language-independent as it only deals with cubin objects.
Context management
Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context." Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().
Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icCoAXb7
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
Since this happens to be found in the DriverAPI it has more flexibility of control towards the programmer but also requires more responsibility to manage where the RuntimeAPI library does things more automatic but gives you less control.
This is apparent since you mentioned that you are working with their Kernels but from the description of their implementation of the function
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
The documentation is telling me that the first parameter that this function takes is a returned array through which a subresource of resource may be accessed. The second parameter of this function is the mapped graphics resource itself. The third parameter in which I believe is the parameter that you had in question where it is an enumerated type to a face and you then asked: When one passes such an enum, then the returned texture array will contain only data of that particular face, right? From what I gather and understand from the documentations is that this is an index value to an array of your cube map resource.
Which can be seen from their documentation:
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icHnwe9v
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
which happens to be an unsigned int or an index location into the textures that make up that cube map a typical cube map will have 6 faces or at most 12 if both inside and outside of the cube are mapped. So if we look at a cube map as well as textures and their relationship with pseudo code we can see that:
// Texture
struct Texture {
unsigned pixelsWidth;
unsigned pixelsHeight;
// Other Texture member variables or fields here.
};
// Only interested in the actual size of the texture `width by height`
// where these would be used to map this texture to one of the 6 faces
// of a cube:
struct CubeMap {
Texture face[6];
// face[0] = frontFace
// face[1] = backFace
// face[2] = leftFace
// face[3] = rightFace
// face[4] = topFace
// face[5] = bottomFace
};
The cubemap object has an array of textures that makes up its face and according to the documents the function that you have in question with its third parameter is asking you for an index into this texture array and the overall function will return this:
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icKF1c00
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
I hope this helps to answer your question in regards to the use of the third parameter into the function you are trying to use from their API.
Edit
The OP had asked when passing this enum CU_CUBEMAP_FACE_POSITIVE_X to the third parameter of the above function call will it return only that face of the cube map which happens to be a texture. When looking at their documentation about this enumerated value or type found here: enum CUarray_cubemap_face
enum CUarray_cubemap_face - Array indices for cube faces
Values
CU_CUBEMAP_FACE_POSITIVE_X = 0x00
Positive X face of cubemap
CU_CUBEMAP_FACE_NEGATIVE_X = 0x01
Negative X face of cubemap
CU_CUBEMAP_FACE_POSITIVE_Y = 0x02
Positive Y face of cubemap
CU_CUBEMAP_FACE_NEGATIVE_Y = 0x03
Negative Y face of cubemap
CU_CUBEMAP_FACE_POSITIVE_Z = 0x04
Positive Z face of cubemap
CU_CUBEMAP_FACE_NEGATIVE_Z = 0x05
Negative Z face of cubemap
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4idOT67US
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
It appears to me that when using this method to query or get texture information that is stored into an array of a cube map, that the requirement of the third parameter being this enumerated value; is nothing more than the 0-index into that array. So by passing in CU_CUBEMAP_FACE_POSITIVE_X as the third parameter to me doesn't necessarily mean that you will only get back that particular face's texture. It appears to me that since this is the 0th index that it will return the entire array of textures. The old C style of passing around arrays as if they were pointers.
Currently I am loading an image in to memory on a 2nd thread, and then during the display loop (if there is a texture load required), load the texture.
I discovered that I could not load the texture on the 2nd thread because OpenGL didn't like that; perhaps this is possible but I did something wrong - so please correct me if this is actually possible.
On the other hand, if my failure was valid - how do I load a texture without disrupting the rendering loop? Currently the textures take around 1 second to load from memory, and although this isn't a major issue, it can be slightly irritating for the user.
You can load a texture from disk to memory on any thread you like, using any tool you wish for reading the files.
However, when you bind it to OpenGL, it's going to need to be handled on the same thread as the rendering for that OpenGL context. That being said, this discussion suggests that using a PBO in a second thread is an option, and can speed up the process.
You can certainly load the texture from disk into RAM in any number of threads you like, but OpenGL won't upload to VRAM in multiple threads for the reason mentioned in Reed's answer.
Given the loading from disk is the slowest part, thats the bit you'll probably want to thread. The loading thread(s) build up a queue of textures to be uploaded, then this queue is consumed by the thread that owns the GL context (mind your access to that queue by the various threads however). You could also consider a non-threaded approach of uploading N textures per frame, where N is a number that doesn't slow the rendering down too much.