Can CPU access to BackBuffer used for Render Target? - c++

strong textI'm very new to DirectX11 and start to study. I have seen D3D11_USAGE enumeration in D3D11_BUFFER_DESC structure and searched what each means.
D3D11_DEFAULT refers that GPU can read/write but all accesses from CPU is not allowed. Therefore this option is used for the back buffer used for the render target.
Then, If I generate a swap chain using DXGI_USAGE_RENDER_TARGET_OUTPUT option for usage and get a back-buffer and create a render target from the back-buffer, can CPU access to the backbuffer or render target?
typedef struct DXGI_SWAP_CHAIN_DESC
{
DXGI_MODE_DESC BufferDesc;
DXGI_SAMPLE_DESC SampleDesc;
DXGI_USAGE BufferUsage;
UINT BufferCount;
HWND OutputWindow;
BOOL Windowed;
DXGI_SWAP_EFFECT SwapEffect;
UINT Flags;
} DXGI_SWAP_CHAIN_DESC;
Based on MSDC document(In the document, it says "Swap chain's only support the DXGI_CPU_ACCESS_NONE value in the DXGI_CPU_ACCESS_FIELD part of DXGI_USAGE."), CPU can't access but I can't find clear instruction. Am I right?
So, Can CPU access to BackBuffer used for Render Target?

The CPU does not have access to the swap chain buffers, as per the MSDN documentation you cited.
I'm not sure why you would need to read from the back buffer in the first place - the process of reading data generated on the GPU from the CPU is so slow you're better off either doing all of your rendering on the CPU in the first place, or redesigning your algorithm so you don't need to access the data from the CPU. In general, the second is the best option - D3D is designed to push data to the GPU from the CPU, not to read data from the GPU into the CPU.
In fact, if you look at the documentation for D3D11_USAGE, you'll find that the only way to create a resource with direct read access from the CPU is to create it as a staging resource (D3D11_USAGE_STAGING), which doesn't support creating views. Resources created with D3D11_USAGE_DYNAMIC can be written to by the CPU, but not read, resources created with D3D11_USAGE_DEFAULT can be copied into through UpdateSubresource, but otherwise cannot be interacted with by the CPU, and resources created with D3D11_USAGE_IMMUTABLE cannot be modified at all after creation.
A swap chain creates its resources with the DXGI equivalent of D3D11_USAGE_DEFAULT, which denies the CPU access to the data. MSDN is merely telling you that this is the case, and to not put any CPU access flags into the Usage member because they will either be ignored or rejected as erroneous.

Related

How to update vertex buffer data frequently in DirectX 11?

I am trying to update my vertex buffer data with the map function in dx. Though it does update the data once, but if i iterate over it the model disappears. i am actually trying to manipulate vertices in real-time by user input and to do so i have to update the vertex buffer every frame while the vertex is selected.
Perhaps this happens because the Map function disables GPU access to the vertices until the Unmap function is called. So if the access is blocked every frame, it kind of makes sense for it to not be able render the mesh. However when i update the vertex every frame and then stop after sometime, theatrically the mesh should show up again, but it doesn't.
i know that the proper way to update data every frame is to use constant buffers, but manipulating vertices with constant buffers might not be a good idea. and i don't think that there is any other way to update the vertex data. i expect dynamic vertex buffers to be able to handle being updated every frame.
D3D11_MAPPED_SUBRESOURCE mappedResource;
ZeroMemory(&mappedResource, sizeof(D3D11_MAPPED_SUBRESOURCE));
// Disable GPU access to the vertex buffer data.
pRenderer->GetDeviceContext()->Map(pVBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
// Update the vertex buffer here.
memcpy((Vertex*)mappedResource.pData + index, pData, sizeof(Vertex));
// Reenable GPU access to the vertex buffer data.
pRenderer->GetDeviceContext()->Unmap(pVBuffer, 0);
As this has been already answered the key issue that you are using Discard (which means you won't be able to retrieve the contents from the GPU), I thought I would add a little in terms of options.
The question I have is whether you require performance or the convenience of having the data in one location?
There are a few configurations you can try.
Set up your Buffer to have both CPU Read and Write Access. This though mean you will be pushing and pulling your buffer up and down the bus. In the end, it also causes performance issues on the GPU such as blocking etc (waiting for the data to be moved back onto the GPU). I personally don't use this in my editor.
If memory is not the issue, set up a copy of your buffer on CPU side, each frame map with Discard and block copy the data across. This is performant, but also memory intensive. You obviously have to manage the data partioning and indexing into this space. I don't use this, but I toyed with it, too much effort!
You bite the bullet, you map to the buffer as per 2, and write each vertex object into the mapped buffer. I do this, and unless the buffer is freaking huge, I havent had issue with it in my own editor.
Use the Computer shader to update the buffer, create a resource view and access view and pass the updates via a constant buffer. Bit of a Sledgehammer to crack a wallnut. And still doesn't stop the fact you may need pull the data back off the GPU ala as per item 1.
There are some variations on managing the buffer, such as interleaving you can play with also (2 copies, one on GPU while the other is being written to) which you can try also. There are some rather ornate mechanisms such as building the content of the buffer in another thread and then flagging the update.
At the end of the day, DX 11 doesn't offer the ability (someone might know better) to edit the data in GPU memory directly, there is alot shifting between CPU and GPU.
Good luck on which ever technique you choose.
Mapping buffer with D3D11_MAP_WRITE_DISCARD flag will cause entire buffer content to become invalid. You can not use it to update just a single vertex. Keep buffer on the CPU side instead and then update entire buffer on GPU side once per frame.
If you develop for UWP - use of map/unmap may result in sync problems. ID3D11DeviceContext methods are not thread safe: https://learn.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro.
If you update buffer from one thread and render from another - you may get different errors. In this case you must use some synchronization mechanism, such as critical sections. Example is here https://developernote.com/2015/11/synchronization-mechanism-in-directx-11-and-xaml-winrt-application/

Device to device copy in Vulkan

I want to copy an image/buffer between two GPUs/physical devices in my Vulkan application (one vkInstance, two vkDevices). Is this possible without staging the image on the CPU or is there a feature like CUDA p2p? How would this look?
If staging on the host is required, what would be the optimal method for this?
is there a feature like CUDA p2p?
Vulkan 1.1 supports the concept of device groups to cover this situation.
It allows you to treat a set of physical devices as a single logical device, and also lets you query how memory can be manipulated within the device group, as well as do things like allocate memory on a subset of devices. Check the specifications for the full set of functionality.
Is this possible without staging the image on the CPU
If your devices don't support the extenson VK_KHR_device_group, then no. You must transfer the content through the CPU and system memory.
Since buffers are per-device, you would need two host-visible staging buffers, one for the read operation, and another for the write operation. You'll also need two queues, two command buffers, etc, etc...
You'll have to execute 3 operations with manual synchronization.
On the source GPU execute a copy from the device-local buffer to the host visible buffer for the same device.
On the CPU copy from the source GPU host visible buffer to the target GPU host-visible buffer
On the target GPU copy from the host-visible buffer to the device-local buffer
Make sure to inspect your device queue family properties and if possible use a queue from a queue family that is marked as transfer capable but not graphics or compute capable. The fewer flags a Vulkan queue family has, the better suited it is to the operations that it does have flags for. Most modern discrete GPUs have dedicated transfer queues, but again, queues are specific to devices, so you'll need to be interacting with one queue for each device to execute the transfer.
If staging on the host is required, what would be the optimal method for this?
Exactly how to execute this depends on your use case. If you want to execute the whole thing synchronously in a single thread, then you'll just be doing a bunch of submits and then waiting on fences. If you want to do it asynchronously in the background while you continue to render frames, then you'll still be doing the submits, but you'll have to non-blocking checking on the fences to see when operations complete before you move to the next part.
If you're transferring buffers there's probably nothing to be worried about in terms of optimal transfer, but if you're dealing with images then you have to get into the whole linear vs optimal image tiling mess. In order to avoid that I'd suggest using host visible buffers for staging, regardless of whether you're transferring images or buffers, and as such use vkCmdCopyImageToBuffer and vkCmdCopyBufferToImage to do the transfers between device-local and host-visible memory

In D3D12, can the render target view be any buffer?

In the samples I have looked at so far some of the commands are something like:
D3D12_DESCRIPTOR_HEAP_DESC with D3D12_DESCRIPTOR_HEAP_TYPE::D3D12_DESCRIPTOR_HEAP_TYPE_RTV
ID3D12Device::CreateDescriptorHeap
D3D12_CPU_DESCRIPTOR_HANDLE with ID3D12DescriptorHeap::GetCPUDescriptorHandleForHeapStart (and maybe ID3D12Device::GetDescriptorHandleIncrementSize)
(Optional?) ID3D12Device::CreateRenderTargetView
(Optional?) IDXGISwapChain3::GetBuffer
Schedule rendering stuff to a command list, of note OMSetRenderTargets and DrawInstanced, and then close the command list.
ID3D12CommandQueue::ExecuteCommandLists
(Optional?) IDXGISwapChain3::Present
Schedule a signal with ID3D12CommandQueue::Signal on a fence
Wait for the GPU to finish with ID3D12Fence::SetEventOnCompletion and WaitForSingleObjectEx
If possible, how can step 4.1 be replaced with a buffer of choice? I.e. how can one create a ID3D12Resource* and render to it, and then read from it into say a std::vector? (I assume if this is possible that step 6.1 can be ignored, since there is no render target view for the swap chain to present to. Perhaps step 4 is unnecessary as well in this case ? Maybe only OMSetRenderTargets matters ?)
It depends on the video memory architecture as to where exactly the render target can be located. On some systems, it's in dedicated video memory that only the video card can access. In some systems it's in video memory shared across the bus that both CPU and GPU can access. In unified memory architectures, everything is in system memory.
Therefore, you have restrictions on where exactly the render target can be located. This is why you have to use D3D12_HEAP_TYPE_DEFAULT and specify D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET when creating the ID3D12Resource you plan to bind as a render target (this is implicitly done by DXGI when you create the swap chain render target as well).
Generally speaking you can't and shouldn't use the low-level DXGI surface creation APIs to create Direct3D resources. They mostly exist for system use rather than by applications.
Unless you happen to be on a UMA system, you should minimize the CPU access to the render target as it will require expensive copies otherwise. Even on UMA systems, there's also required de-tiling as well to get the results into a linear form.
Direct3D 12 also offers the "Placed Resource" methods as well which can provide more control over where exactly the memory is allocated (or more specifically where the virtual memory addresses are allocated), but you still have to abide by the underlying architecture limitations. Depending on the memory architecture, you can "alias" multiple different ID3D12Resource instances all using the same memory (such as a render target being be aliased as a unordered access resource), but you are responsible for inserting the required resource barriers into the command list (and test it) to make sure that it works reliably on all DX12 hardware. See MSDN.
You do not have to Present your render target if you don't need the user to see the result.
Memory Management Strategies
UMA Optimizations: CPU Accessible Textures and Standard Swizzle
Getting Started with Direct3D 12
If you are new to DirectX 12, you should see the DirectX Tool Kit for DirectX 12 tutorials. If you aren't already familiar with DirectX 11, you should wait on DirectX 12 and start with DirectX Tool Kit for DirectX 11.

The fastests way to copy to and from Texture2D

For some reason I have to copy from textures to buffer and then reload it back to the texture.
The source texture is the one coming from decoder, target texture is the one which will be rendered. The easiest way to do that (as I understand) is to do the following:
decoder tex(ID3D11Texture2D)
Use temp texture (Usage = D3D11_USAGE_STAGING)
CopyResource to temp texture
'Map'
memcpy_s to buffer
Unmap
on the other side it goes backward
Use temp texture (Usage = D3D11_USAGE_STAGING)
Map
memcpy_s from buffer
Unmap
CopyResource to renderer texture
Works fine, however I have a feeling I'm not doing it as efficient as possible (aside the fact I'm copying data back and forth)
Do I have to use staging textures? Can I tweak the decoder/renderer texture flags (BindFlags?) or the Map's D3D11_MAP enumeration to skip copying to staging texture?
EDIT001:
Ok, here goes the case, with technical details. There is a decoder, essentially it is Intel Media SDK decoder which decodes (pun intended) data provided from outside the decoding class. So, it receives a buffer, does its magic(asynchronously) and returns (by means of SyncOperation, If I recall the method name right) a surface, which is actually, under the hood DX texture, managed by the Intel allocator. I receive and copy the texture synchronously, but I guess, with a little effort I can do it asynchronously. The surface originates from a pool, so, working with the texture does not stops the decoder to keep the work on. The copied data resides in a struct which is kept in ring buffer, from which the video renderer is fed. That's it, to my understanding (a little one, I have to notice) there is no harm to the GPU parallelism.
If you have to read and write back, there is no fast way, you break GPU/CPU parallelism by forcing sync point and you will create many idle bubbles on the CPU and GPU.
Only the staging pool is accessible to the CPU, so yes, the temporary resource for the back and forth is necessary.
For performance, you should consider :
Adapt your technique to be GPU only
If read back is the only way, limit to portion that are dirty or neccessary
Try to work with a couple of texture accross frame to let the CPU work on a version that is a few frame behind to protect parallelism.

CUDA cube map textures

How to deal with OpenGL cube map textures in CUDA?
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernel and one need to use 2D textures with proper calculations and sampling in user code?
OK - I managed to solve the problem myself, though the solution isn't as simple as using another CUDA function.
To bind a CUDA texture reference with any texture, be it one obtained from OpenGL or D3D, one has to provide a CUDA array that is mapped to a resource, using cuGraphicsSubResourceGetMappedArray to retrieve it. As I mentioned in the question, it is simple in case of a one or two dimensional texture. But with other available types it is more complicated.
At any time we need a CUDA array that the reference is bound to. Same goes with the cube map texture. But in such a case the array has to be a 3D one. The problem is that CUDA driver API provides only the aforementioned function to retrieve a single layer from such a texture resource, and map it to a single, two dimensional array. To get what we want we have to make ourselves the 3D array containing all the layers (or faces in case of a cube map).
First of all we have to get arrays for each layer/face using the above function. Next step is to create the 3D array by call to cuArray3DCreate, fed with proper set of parameters (size/number of layers, level of detail, data format, number of channels per texel and some flags). Then we have to copy the layers' arrays to the 3D one with a series of calls to cuMemcpy3D, one for each layer/face array.
Finally, we set our target CUDA texture reference with cuTexRefSetArray, fed with the 3D array we created and copied to. Inside of the device code we create a reference with proper texture type and mode (float4 and cube map) and sample it with texCubemap.
Below I put a fragment of the function which does all that, available in full length in CIRT Repository (cirt_server.c file, function cirtTexImage3D).
//...
if (result)
{
// Create a 3D array...
CUDA_ARRAY3D_DESCRIPTOR layeredTextureDescr;
layeredTextureDescr.Width = w;
layeredTextureDescr.Height = h;
layeredTextureDescr.Depth = d;
layeredTextureDescr.Format = map_type_to_format(type);
layeredTextureDescr.NumChannels = format == CIRT_RGB ? CIRT_RGBA : format;
layeredTextureDescr.Flags = map_target_to_flags(target);
if (result) result = LogCUDADriverCall(cuArray3DCreate(&hTexRefArray, &layeredTextureDescr),
FUN_NAME(": cuArray3DCreate_tex3D"), __FILE_LINE__);
// Copy the acquired layer/face arrays into the collective 3D one...
CUDA_MEMCPY3D layerCopyDescr;
layerCopyDescr.srcMemoryType = CU_MEMORYTYPE_ARRAY;
layerCopyDescr.srcXInBytes = 0;
layerCopyDescr.srcZ = 0;
layerCopyDescr.srcY = 0;
layerCopyDescr.srcLOD = 0;
layerCopyDescr.dstMemoryType = CU_MEMORYTYPE_ARRAY;
layerCopyDescr.dstLOD = 0;
layerCopyDescr.WidthInBytes = layeredTextureDescr.NumChannels * w;
layerCopyDescr.Height = h;
layerCopyDescr.Depth = target == CIRT_TEXTURE_CUBE_MAP ? 1 : d;
layerCopyDescr.dstArray = hTexRefArray;
for (i = 0; i < num_layers; ++i)
{
layer = ((num_layers == 6) ? CU_CUBEMAP_FACE_POSITIVE_X + i : i);
layerCopyDescr.dstXInBytes = 0;
layerCopyDescr.dstY = 0;
layerCopyDescr.dstZ = i;
layerCopyDescr.srcArray = hLayres[i];
if (result) result = LogCUDADriverCall(cuMemcpy3D(&layerCopyDescr),
FUN_NAME(": cuMemcpy3D_tex3D"), __FILE_LINE__);
}
// Finally bind the 3D array with texture reference...
if (result) LogCUDADriverCall(cuTexRefSetArray(hTexRef, hTexRefArray, CU_TRSA_OVERRIDE_FORMAT),
FUN_NAME(": cuTexRefSetArray_tex3D"), __FILE_LINE__);
if (hLayres)
free(hLayres);
if (result)
current->m_oTextureManager.m_cuTextureRes[current->m_oTextureManager.m_nTexCount++] = hTexResource;
}
//...
I've checked it with cube maps only for now but it should work just fine with 3D texture as well.
I'm not real familiar with CUDA directly but I do have some experience in OpenGL and DirectX and I am also familiar with 3D Graphics Rendering APIs, Libraries and Pipelines and having the ability to setup and use those APIs.
When I look at your question(s):
How to deal with OpenGL cube map textures in CUDA?
And you proceed to explain it by this:
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernal and one need to use 2D textures with proper calculations and sampling in user code?
I went to CUDA's website for their API SDK & Programming Documentations. And found the function in question cuGraphicsSubResourceGetMappedArray()
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
Get an array through which to access a subresource of a mapped graphics resource.
Parameters
pArray - Returned array through which a subresource of resource may be accessed
resource - Mapped resource to access
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
mipLevel - Mipmap level for the subresource to access
Returns
CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE,
CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_NOT_MAPPED,
CUDA_ERROR_NOT_MAPPED_AS_ARRAY
Description
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
If resource is not a texture then it cannot be accessed via an array and CUDA_ERROR_NOT_MAPPED_AS_ARRAY is returned. If arrayIndex is not a valid array index for resource then CUDA_ERROR_INVALID_VALUE is returned. If mipLevel is not a valid mipmap level for resource then CUDA_ERROR_INVALID_VALUE is returned. If resource is not mapped then CUDA_ERROR_NOT_MAPPED is returned.
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cuGraphicsResourceGetMappedPointer
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4ic22V4Dz
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
This function method was found in NVidia CUDA's DriverAPI and not in their RuntimeAPI. When understanding hardware with CUDA capability is that there is a difference between the Host and Device programmable pipelines which can be found here: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#axzz4ic6tFjXR
2. Heterogeneous Computing
CUDA programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more CUDA-enabled NVIDIA GPU devices.
While NVIDIA GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution.
However, the device is based on a distinctly different design from the host system, and it's important to understand those differences and how they determine the performance of CUDA applications in order to use CUDA effectively.
2.1. Differences between Host and Device
The primary differences are in threading model and in separate physical memories:
Threading resources - Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four hex-core processors today can run only 24 threads concurrently (or 48 if the CPUs support Hyper-Threading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.
Threads - Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimize latency for one or two threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.
RAM - The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, items in the host memory must occasionally be communicated across the bus to the device memory or vice versa as described in What Runs on a CUDA-Enabled Device?
These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document. Applications composed with these differences in mind can treat the host and device together as a cohesive heterogeneous system wherein each processing unit is leveraged to do the kind of work it does best: sequential work on the host and parallel work on the device.
Read more at: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#ixzz4ic8ch2fq
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
Now knowing that there are two different APIs for CUDAs API Libraries we have to understand the difference between the two found here: Difference Between the driver and runtime APIs
1. Difference between the driver and runtime APIs
The driver and runtime APIs are very similar and can for the most part be used interchangeably. However, there are some key differences worth noting between the two.
Complexity vs. control
The runtime API eases device code management by providing implicit initialization, context management, and module management. This leads to simpler code, but it also lacks the level of control that the driver API has.
In comparison, the driver API offers more fine-grained control, especially over contexts and module loading. Kernel launches are much more complex to implement, as the execution configuration and kernel parameters must be specified with explicit function calls. However, unlike the runtime, where all the kernels are automatically loaded during initialization and stay loaded for as long as the program runs, with the driver API it is possible to only keep the modules that are currently needed loaded, or even dynamically reload modules. The driver API is also language-independent as it only deals with cubin objects.
Context management
Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context." Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().
Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icCoAXb7
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
Since this happens to be found in the DriverAPI it has more flexibility of control towards the programmer but also requires more responsibility to manage where the RuntimeAPI library does things more automatic but gives you less control.
This is apparent since you mentioned that you are working with their Kernels but from the description of their implementation of the function
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
The documentation is telling me that the first parameter that this function takes is a returned array through which a subresource of resource may be accessed. The second parameter of this function is the mapped graphics resource itself. The third parameter in which I believe is the parameter that you had in question where it is an enumerated type to a face and you then asked: When one passes such an enum, then the returned texture array will contain only data of that particular face, right? From what I gather and understand from the documentations is that this is an index value to an array of your cube map resource.
Which can be seen from their documentation:
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icHnwe9v
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
which happens to be an unsigned int or an index location into the textures that make up that cube map a typical cube map will have 6 faces or at most 12 if both inside and outside of the cube are mapped. So if we look at a cube map as well as textures and their relationship with pseudo code we can see that:
// Texture
struct Texture {
unsigned pixelsWidth;
unsigned pixelsHeight;
// Other Texture member variables or fields here.
};
// Only interested in the actual size of the texture `width by height`
// where these would be used to map this texture to one of the 6 faces
// of a cube:
struct CubeMap {
Texture face[6];
// face[0] = frontFace
// face[1] = backFace
// face[2] = leftFace
// face[3] = rightFace
// face[4] = topFace
// face[5] = bottomFace
};
The cubemap object has an array of textures that makes up its face and according to the documents the function that you have in question with its third parameter is asking you for an index into this texture array and the overall function will return this:
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icKF1c00
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
I hope this helps to answer your question in regards to the use of the third parameter into the function you are trying to use from their API.
Edit
The OP had asked when passing this enum CU_CUBEMAP_FACE_POSITIVE_X to the third parameter of the above function call will it return only that face of the cube map which happens to be a texture. When looking at their documentation about this enumerated value or type found here: enum CUarray_cubemap_face
enum CUarray_cubemap_face - Array indices for cube faces
Values
CU_CUBEMAP_FACE_POSITIVE_X = 0x00
Positive X face of cubemap
CU_CUBEMAP_FACE_NEGATIVE_X = 0x01
Negative X face of cubemap
CU_CUBEMAP_FACE_POSITIVE_Y = 0x02
Positive Y face of cubemap
CU_CUBEMAP_FACE_NEGATIVE_Y = 0x03
Negative Y face of cubemap
CU_CUBEMAP_FACE_POSITIVE_Z = 0x04
Positive Z face of cubemap
CU_CUBEMAP_FACE_NEGATIVE_Z = 0x05
Negative Z face of cubemap
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4idOT67US
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
It appears to me that when using this method to query or get texture information that is stored into an array of a cube map, that the requirement of the third parameter being this enumerated value; is nothing more than the 0-index into that array. So by passing in CU_CUBEMAP_FACE_POSITIVE_X as the third parameter to me doesn't necessarily mean that you will only get back that particular face's texture. It appears to me that since this is the 0th index that it will return the entire array of textures. The old C style of passing around arrays as if they were pointers.