Vulkan RT BLAS build sometimes fail. No Validation Layer message - c++

I have a problem when creating BLAS in Vulkan with Ray Tracing. Basically, not always, but often when I send the command "vkCmdBuildAccelerationStructuresKHR" via a commandBuffer in the Compute queue the VkDevice becomes VK_ERROR_DEVICE_LOST. The vkQueueSubmit returns VK_SUCCESS, but when I try to wait for the sent command to finish vkDeviceWaitIdle returns VK_ERROR_DEVICE_LOST. All the buffers used are allocated without errors and it is possible to obtain the address on the device. I also use the VMA (Vulkan Memory Management) library to manage the allocations. The buffers were created with the property VK_SHARING_MODE_EXCLUSIVE but are only used in the commandBuffer of the Compute queue. The real problem is that the validation layer does not give any error messages.
The code for creating vertex buffer:
VkBufferCreateInfo vkVertexBufferCreateInfo{};
vkVertexBufferCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
vkVertexBufferCreateInfo.size = vertexSize;
vkVertexBufferCreateInfo.usage = VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_BUILD_INPUT_READ_ONLY_BIT_KHR
| VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT
| VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
vkVertexBufferCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
vkVertexBufferCreateInfo.pQueueFamilyIndices = VK_NULL_HANDLE;
vkVertexBufferCreateInfo.queueFamilyIndexCount = 0;
VmaAllocationCreateInfo vmaVertexBufferAllocationCreateInfo{};
vmaVertexBufferAllocationCreateInfo.flags = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT;
vmaVertexBufferAllocationCreateInfo.requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
The code for creating index buffer:
VkBufferCreateInfo vkIndexBufferCreateInfo{};
vkIndexBufferCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
vkIndexBufferCreateInfo.size = faceSize;
vkIndexBufferCreateInfo.usage = VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_BUILD_INPUT_READ_ONLY_BIT_KHR
| VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT
| VK_BUFFER_USAGE_INDEX_BUFFER_BIT | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
vkIndexBufferCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
vkIndexBufferCreateInfo.pQueueFamilyIndices = VK_NULL_HANDLE;
vkIndexBufferCreateInfo.queueFamilyIndexCount = 0;
VmaAllocationCreateInfo vmaIndexBufferAllocationCreateInfo = {};
vmaIndexBufferAllocationCreateInfo.flags = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT;
vmaIndexBufferAllocationCreateInfo.requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
The code to create struct for geometry info:
// Query the 64-bit vertex/index buffer device address value through which buffer memory
// can be accessed in a shader
std::optional<VkDeviceAddress> vertexBufferAddress = geometry.getVertexBuffer().getBufferDeviceAddress();
if (vertexBufferAddress.has_value() == false)
{
OV_LOG_ERROR("Fail to retrive the device address of the vertex buffer. Probably geometry not uploaded.");
return false;
}
std::optional<VkDeviceAddress> faceBufferAddress = geometry.getFaceBuffer().getBufferDeviceAddress();
if (faceBufferAddress.has_value() == false)
{
OV_LOG_ERROR("Fail to retrive the device address of the face buffer. Probably geometry not uploaded.");
return false;
}
VkDeviceOrHostAddressConstKHR vertexDeviceOrHostAddressConst = {};
vertexDeviceOrHostAddressConst.deviceAddress = vertexBufferAddress.value();
VkDeviceOrHostAddressConstKHR faceDeviceOrHostAddressConst = {};
faceDeviceOrHostAddressConst.deviceAddress = faceBufferAddress.value();
// Structure specifying a triangle geometry in a bottom-level acceleration structure
VkAccelerationStructureGeometryTrianglesDataKHR accelerationStructureGeometryTrianglesData = {};
accelerationStructureGeometryTrianglesData.sType =
VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_GEOMETRY_TRIANGLES_DATA_KHR;
accelerationStructureGeometryTrianglesData.pNext = NULL;
// Vertex glm::vec3
accelerationStructureGeometryTrianglesData.vertexFormat = VK_FORMAT_R32G32B32_SFLOAT;
accelerationStructureGeometryTrianglesData.vertexData = vertexDeviceOrHostAddressConst;
// sizeof(float) * 3 => vertex
// sizeof(uint32_t) * 3 => normal / uv / tangent
accelerationStructureGeometryTrianglesData.vertexStride = sizeof(Ov::Geometry::VertexData);
// # vertices = vertex buffer size bytes / vertex stride
accelerationStructureGeometryTrianglesData.maxVertex = geometry.getNrOfVertices();
accelerationStructureGeometryTrianglesData.indexType = VK_INDEX_TYPE_UINT32;
accelerationStructureGeometryTrianglesData.indexData = faceDeviceOrHostAddressConst;
// transformData is a device or host address to memory containing an optional reference to
// a VkTransformMatrixKHR structure
accelerationStructureGeometryTrianglesData.transformData = transformData;
// Union specifying acceleration structure geometry data
VkAccelerationStructureGeometryDataKHR accelerationStructureGeometryData = {};
accelerationStructureGeometryData.triangles = accelerationStructureGeometryTrianglesData;
// Structure specifying geometries to be built into an acceleration structure
VkAccelerationStructureGeometryKHR& accelerationStructureGeometry = reserved->geometriesAS.emplace_back();
accelerationStructureGeometry.sType = VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_GEOMETRY_KHR;
accelerationStructureGeometry.pNext = NULL;
accelerationStructureGeometry.geometryType = VK_GEOMETRY_TYPE_TRIANGLES_KHR;
accelerationStructureGeometry.geometry = accelerationStructureGeometryData;
accelerationStructureGeometry.flags = geometryFlags;
// Structure specifying build offsets and counts for acceleration structure builds
VkAccelerationStructureBuildRangeInfoKHR& accelerationStructureBuildRangeInfoKHR = reserved->geometriesBuildRangeAS.emplace_back();
// primitiveCount defines the number of primitives for a corresponding acceleration structure geometry.
accelerationStructureBuildRangeInfoKHR.primitiveCount = geometry.getNrOfFaces();
accelerationStructureBuildRangeInfoKHR.primitiveOffset = 0;
accelerationStructureBuildRangeInfoKHR.firstVertex = 0;
accelerationStructureBuildRangeInfoKHR.transformOffset = 0;
Here is the code for building BLAS:
// Structure specifying the geometry data used to build an acceleration structure.
reserved->accelerationStructureBuildGeometryInfo.sType =
VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_BUILD_GEOMETRY_INFO_KHR;
reserved->accelerationStructureBuildGeometryInfo.pNext = NULL;
reserved->accelerationStructureBuildGeometryInfo.type = type;
reserved->accelerationStructureBuildGeometryInfo.flags = flags;
// VK_BUILD_ACCELERATION_STRUCTURE_MODE_BUILD_KHR => specifies that the destination acceleration
// structure will be built using the specified geometries.
reserved->accelerationStructureBuildGeometryInfo.mode = VK_BUILD_ACCELERATION_STRUCTURE_MODE_BUILD_KHR;
reserved->accelerationStructureBuildGeometryInfo.srcAccelerationStructure = VK_NULL_HANDLE;
reserved->accelerationStructureBuildGeometryInfo.dstAccelerationStructure = VK_NULL_HANDLE;
reserved->accelerationStructureBuildGeometryInfo.geometryCount = nrOfgeometriesStructuresAS;
// The index of each element of the pGeometries or ppGeometries members of VkAccelerationStructureBuildGeometryInfoKHR
// is used as the geometry index during ray traversal.The geometry index is available in ray shaders via the
// RayGeometryIndexKHR built - in, and is used to determine hitand intersection shaders executed
// during traversal.The geometry index is available to ray queries via the OpRayQueryGetIntersectionGeometryIndexKHR instruction.
reserved->accelerationStructureBuildGeometryInfo.pGeometries = geometriesStructuresAS.data();
reserved->accelerationStructureBuildGeometryInfo.ppGeometries = NULL;
reserved->accelerationStructureBuildGeometryInfo.scratchData = {};
// Structure specifying build sizes for an acceleration structure
VkAccelerationStructureBuildSizesInfoKHR accelerationStructureBuildSizesInfo = {};
accelerationStructureBuildSizesInfo.sType = VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_BUILD_SIZES_INFO_KHR;
accelerationStructureBuildSizesInfo.pNext = NULL;
accelerationStructureBuildSizesInfo.accelerationStructureSize = 0;
accelerationStructureBuildSizesInfo.updateScratchSize = 0;
accelerationStructureBuildSizesInfo.buildScratchSize = 0;
// Retrieve the required size for an acceleration structure
// VK_ACCELERATION_STRUCTURE_BUILD_TYPE_DEVICE_KHR => requests the memory requirement for operations
// performed by the device.
PFN_vkGetAccelerationStructureBuildSizesKHR pvkGetAccelerationStructureBuildSizesKHR =
(PFN_vkGetAccelerationStructureBuildSizesKHR)vkGetDeviceProcAddr(logicalDevice.get().getVkDevice(), "vkGetAccelerationStructureBuildSizesKHR");
pvkGetAccelerationStructureBuildSizesKHR(logicalDevice.get().getVkDevice(),
VK_ACCELERATION_STRUCTURE_BUILD_TYPE_HOST_KHR,
&reserved->accelerationStructureBuildGeometryInfo,
&reserved->accelerationStructureBuildGeometryInfo.geometryCount,
&accelerationStructureBuildSizesInfo);
////////////////////
// Scratch buffer //
////////////////////
#pragma region ScratchBuffer
///////////////////////////
// Create scratch buffer //
///////////////////////////
// Create info buffer
VkBufferCreateInfo vkScratchBufferCreateInfo{};
vkScratchBufferCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
// The second field of the struct is size, which specifies the size of the buffer in bytes
vkScratchBufferCreateInfo.size = accelerationStructureBuildSizesInfo.buildScratchSize;
// The third field is usage, which indicates for which purposes the data in the buffer
// is going to be used. It is possible to specify multiple purposes using a bitwise or.
// VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_BUILD_INPUT_READ_ONLY_BIT_KHR => specifies that the buffer is suitable for
// use as a read-only input to an acceleration structure build.
// VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT => specifies that the buffer can be used to retrieve a buffer device address
// via vkGetBufferDeviceAddress and use that address to access the buffer’s memory from a shader.
vkScratchBufferCreateInfo.usage = VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_BUILD_INPUT_READ_ONLY_BIT_KHR
| VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
// buffers can also be owned by a specific queue family or be shared between multiple
// at the same time.
// VK_SHARING_MODE_CONCURRENT specifies that concurrent access to any range or image subresource of the object
// from multiple queue families is supported.
vkScratchBufferCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
// From which queue family the buffer will be accessed.
vkScratchBufferCreateInfo.pQueueFamilyIndices = NULL;
vkScratchBufferCreateInfo.queueFamilyIndexCount = 0;
// Create allocation info
VmaAllocationCreateInfo vmaScratchBufferAllocationCreateInfo = {};
// VMA_ALLOCATION_CREATE_MAPPED_BIT => Set this flag to use a memory that will be persistently
// mappedand retrieve pointer to it. It is valid to use this flag for allocation made from memory
// type that is not HOST_VISIBLE. This flag is then ignored and memory is not mapped. This is useful
// if you need an allocation that is efficient to use on GPU (DEVICE_LOCAL) and still want to map it
// directly if possible on platforms that support it (e.g. Intel GPU).
vmaScratchBufferAllocationCreateInfo.flags = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT;
// Flags that must be set in a Memory Type chosen for an allocation.
vmaScratchBufferAllocationCreateInfo.requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
if (!reserved->scratchBuffer.create(vkScratchBufferCreateInfo, vmaScratchBufferAllocationCreateInfo))
{
OV_LOG_ERROR("Fail creation scrath buffer for BLAS.");
this->free();
return false;
}
////////////////////////
// Set scratch buffer //
////////////////////////
std::optional<VkDeviceAddress> deviceAddress = reserved->scratchBuffer.getBufferDeviceAddress();
if (deviceAddress.has_value() == false)
{
OV_LOG_ERROR("Fail to retrieve the scratch buffer device address.");
this->free();
return false;
}
VkDeviceOrHostAddressKHR scratchDeviceOrHostAddress = {};
scratchDeviceOrHostAddress.deviceAddress = deviceAddress.value();
// ScratchData is the device or host address to memory that will be used as scratch memory for the build.
reserved->accelerationStructureBuildGeometryInfo.scratchData = scratchDeviceOrHostAddress;
#pragma endregion
/////////////////
// BLAS buffer //
/////////////////
#pragma region BLASBuffer
// Create BLASBuffer
// Create info buffer
VkBufferCreateInfo vkBLASBufferCreateInfo{};
vkBLASBufferCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
// The second field of the struct is size, which specifies the size of the buffer in bytes
vkBLASBufferCreateInfo.size = accelerationStructureBuildSizesInfo.accelerationStructureSize;
// The third field is usage, which indicates for which purposes the data in the buffer
// is going to be used. It is possible to specify multiple purposes using a bitwise or.
// VK_BUFFER_USAGE_TRANSFER_SRC_BIT specifies that the buffer can be used as the source of a transfer command.
vkBLASBufferCreateInfo.usage = VK_BUFFER_USAGE_ACCELERATION_STRUCTURE_STORAGE_BIT_KHR |
VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT;
// buffers can also be owned by a specific queue family or be shared between multiple
// at the same time.
// VK_SHARING_MODE_CONCURRENT specifies that concurrent access to any range or image subresource of the object
// from multiple queue families is supported.
vkBLASBufferCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
// From which queue family the buffer will be accessed.
vkBLASBufferCreateInfo.pQueueFamilyIndices = NULL;
vkBLASBufferCreateInfo.queueFamilyIndexCount = 0;
// Create allocation info
VmaAllocationCreateInfo vmaBLASBufferAllocationCreateInfo = {};
// VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT => Allocation strategy that chooses smallest possible free range for the allocation
// to minimize memory usage and fragmentation, possibly at the expense of allocation time.
vmaBLASBufferAllocationCreateInfo.flags = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT;
// Flags that must be set in a Memory Type chosen for an allocation.
vmaBLASBufferAllocationCreateInfo.requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
if (!reserved->accelerationStructureBuffer.create(vkBLASBufferCreateInfo, vmaBLASBufferAllocationCreateInfo))
{
OV_LOG_ERROR("Fail creation scrath buffer for BLAS.");
this->free();
return false;
}
#pragma endregion
//////////////
// Build AS //
//////////////
#pragma region BuildAS
// Structure specifying the parameters of a newly created acceleration structure object
VkAccelerationStructureCreateInfoKHR accelerationStructureCreateInfo = {};
accelerationStructureCreateInfo.sType = VK_STRUCTURE_TYPE_ACCELERATION_STRUCTURE_CREATE_INFO_KHR;
accelerationStructureCreateInfo.pNext = NULL;
accelerationStructureCreateInfo.createFlags = 0;
accelerationStructureCreateInfo.buffer = reserved->accelerationStructureBuffer.getVkBuffer();
accelerationStructureCreateInfo.offset = 0;
accelerationStructureCreateInfo.size = accelerationStructureBuildSizesInfo.accelerationStructureSize;
accelerationStructureCreateInfo.type = type;
accelerationStructureCreateInfo.deviceAddress = 0;
// Create a new acceleration structure object
PFN_vkCreateAccelerationStructureKHR pvkCreateAccelerationStructureKHR =
(PFN_vkCreateAccelerationStructureKHR)vkGetDeviceProcAddr(logicalDevice.get().getVkDevice(), "vkCreateAccelerationStructureKHR");
if (pvkCreateAccelerationStructureKHR(logicalDevice.get().getVkDevice(), &accelerationStructureCreateInfo, NULL,
&reserved->accelerationStructure) != VK_SUCCESS)
{
OV_LOG_ERROR("Fail to create AS, id: %d.", this->Ov::Object::getId());
this->free();
return false;
}
// dstAccelerationStructure is a pointer to the target acceleration structure for the build.
reserved->accelerationStructureBuildGeometryInfo.dstAccelerationStructure = reserved->accelerationStructure;

After a good night's sleep I found the answer to the problem. Basically, the error is due to an incorrect parameter passed to the function pvkGetAccelerationStructureBuildSizesKHR(). As parameter pMaxPrimitiveCounts I was passing the number of geometries but it is necessary to pass the number of primitives for each geometry.
Vulkan spec:
If pBuildInfo->geometryCount is not 0, pMaxPrimitiveCounts must be a valid pointer to an array of pBuildInfo->geometryCount uint32_t values

Related

Intel OneAPI Video decoding memory leak when using C++ CLI

I am trying to use Intel OneAPI/OneVPL to decode a stream I receive from an RTSP Camera in C#. But when I run the code I get an enormous memory leak. Around 1-200MB per run, which is around once every second.
When I've collected a GoP from the camera where I know the first data is a keyframe I pass it as a byte array to my CLI and C++ code.
Here I expect it to decode all the frames and return decoded images. It receives 30 frames and returns 16 decoded images but has a memory leak.
I've tried to use Visual Studio memory profiler and all I can tell from it is that its unmanaged memory that's my problem. I've tried to override the "new" and "delete" method inside videoHandler.cpp to track and compare all allocations and deallocations and as far as I can tell everything is handled correctly in there. I cannot see any classes that get instantiated that do not get cleaned up. I think my issue is in the CLI class videoHandlerWrapper.cpp. Am I missing something obvious?
videoHandlerWrapper.cpp
array<imgFrameWrapper^>^ videoHandlerWrapper::decode(array<System::Byte>^ byteArray)
{
array<imgFrameWrapper^>^ returnFrames = gcnew array<imgFrameWrapper^>(30);
{
std::vector<imgFrame> frames(30); //Output from decoding process. imgFrame implements a deconstructor that will rid the data when exiting scope
std::vector<unsigned char> bytes(byteArray->Length); //Input for decoding process
Marshal::Copy(byteArray, 0, IntPtr((unsigned char*)(&((bytes)[0]))), byteArray->Length); //Copy from managed (C#) to unmanaged (C++)
int status = _pVideoHandler->decode(bytes, frames); //Decode
for (size_t i = 0; i < frames.size(); i++)
{
if (frames[i].size > 0)
returnFrames[i] = gcnew imgFrameWrapper(frames[i].size, frames[i].bytes);
}
}
//PrintMemoryUsage();
return returnFrames;
}
videoHandler.cpp
#define BITSTREAM_BUFFER_SIZE 2000000 //TODO Maybe higher or lower bitstream buffer. Thorough testing has been done at 2000000
int videoHandler::decode(std::vector<unsigned char> bytes, std::vector<imgFrame> &frameData)
{
int result = -1;
bool isStillGoing = true;
mfxBitstream bitstream = { 0 };
mfxSession session = NULL;
mfxStatus sts = MFX_ERR_NONE;
mfxSurfaceArray* outSurfaces = nullptr;
mfxU32 framenum = 0;
mfxU32 numVPPCh = 0;
mfxVideoChannelParam* mfxVPPChParams = nullptr;
void* accelHandle = NULL;
mfxVideoParam mfxDecParams = {};
mfxVersion version = { 0, 1 };
//variables used only in 2.x version
mfxConfig cfg = NULL;
mfxLoader loader = NULL;
mfxVariant inCodec = {};
std::vector<mfxU8> input_buffer;
// Initialize VPL session for any implementation of HEVC/H265 decode
loader = MFXLoad();
VERIFY(NULL != loader, "MFXLoad failed -- is implementation in path?");
cfg = MFXCreateConfig(loader);
VERIFY(NULL != cfg, "MFXCreateConfig failed")
inCodec.Type = MFX_VARIANT_TYPE_U32;
inCodec.Data.U32 = MFX_CODEC_AVC;
sts = MFXSetConfigFilterProperty(
cfg,
(mfxU8*)"mfxImplDescription.mfxDecoderDescription.decoder.CodecID",
inCodec);
VERIFY(MFX_ERR_NONE == sts, "MFXSetConfigFilterProperty failed for decoder CodecID");
sts = MFXCreateSession(loader, 0, &session);
VERIFY(MFX_ERR_NONE == sts, "Not able to create VPL session");
// Print info about implementation loaded
version = ShowImplInfo(session);
//VERIFY(version.Major > 1, "Sample requires 2.x API implementation, exiting");
if (version.Major == 1) {
mfxVariant ImplValueSW;
ImplValueSW.Type = MFX_VARIANT_TYPE_U32;
ImplValueSW.Data.U32 = MFX_IMPL_TYPE_SOFTWARE;
MFXSetConfigFilterProperty(cfg, (mfxU8*)"mfxImplDescription.Impl", ImplValueSW);
sts = MFXCreateSession(loader, 0, &session);
VERIFY(MFX_ERR_NONE == sts, "Not able to create VPL session");
}
// Convenience function to initialize available accelerator(s)
accelHandle = InitAcceleratorHandle(session);
bitstream.MaxLength = BITSTREAM_BUFFER_SIZE;
bitstream.Data = (mfxU8*)calloc(bytes.size(), sizeof(mfxU8));
VERIFY(bitstream.Data, "Not able to allocate input buffer");
bitstream.CodecId = MFX_CODEC_AVC;
std::copy(bytes.begin(), bytes.end(), bitstream.Data);
bitstream.DataLength = static_cast<mfxU32>(bytes.size());
memset(&mfxDecParams, 0, sizeof(mfxDecParams));
mfxDecParams.mfx.CodecId = MFX_CODEC_AVC;
mfxDecParams.IOPattern = MFX_IOPATTERN_OUT_SYSTEM_MEMORY;
sts = MFXVideoDECODE_DecodeHeader(session, &bitstream, &mfxDecParams);
VERIFY(MFX_ERR_NONE == sts, "Error decoding header\n");
numVPPCh = 1;
mfxVPPChParams = new mfxVideoChannelParam[numVPPCh];
for (mfxU32 i = 0; i < numVPPCh; i++) {
mfxVPPChParams[i] = {};
}
//mfxVPPChParams[0].VPP.FourCC = mfxDecParams.mfx.FrameInfo.FourCC;
mfxVPPChParams[0].VPP.FourCC = MFX_FOURCC_BGRA;
mfxVPPChParams[0].VPP.ChromaFormat = MFX_CHROMAFORMAT_YUV420;
mfxVPPChParams[0].VPP.PicStruct = MFX_PICSTRUCT_PROGRESSIVE;
mfxVPPChParams[0].VPP.FrameRateExtN = 30;
mfxVPPChParams[0].VPP.FrameRateExtD = 1;
mfxVPPChParams[0].VPP.CropW = 1920;
mfxVPPChParams[0].VPP.CropH = 1080;
//Set value directly if input and output is the same.
mfxVPPChParams[0].VPP.Width = 1920;
mfxVPPChParams[0].VPP.Height = 1080;
//// USED TO RESIZE. IF INPUT IS THE SAME AS OUTPUT THIS WILL MAKE IT SHIFT A BIT. 1920x1080 becomes 1920x1088.
//mfxVPPChParams[0].VPP.Width = ALIGN16(mfxVPPChParams[0].VPP.CropW);
//mfxVPPChParams[0].VPP.Height = ALIGN16(mfxVPPChParams[0].VPP.CropH);
mfxVPPChParams[0].VPP.ChannelId = 1;
mfxVPPChParams[0].Protected = 0;
mfxVPPChParams[0].IOPattern = MFX_IOPATTERN_IN_SYSTEM_MEMORY | MFX_IOPATTERN_OUT_SYSTEM_MEMORY;
mfxVPPChParams[0].ExtParam = NULL;
mfxVPPChParams[0].NumExtParam = 0;
sts = MFXVideoDECODE_VPP_Init(session, &mfxDecParams, &mfxVPPChParams, numVPPCh); //This causes a MINOR memory leak!
outSurfaces = new mfxSurfaceArray;
while (isStillGoing == true) {
sts = MFXVideoDECODE_VPP_DecodeFrameAsync(session,
&bitstream,
NULL,
0,
&outSurfaces); //Big memory leak. 100MB pr run in the while loop.
switch (sts) {
case MFX_ERR_NONE:
// decode output
if (framenum >= 30)
{
isStillGoing = false;
break;
}
sts = WriteRawFrameToByte(outSurfaces->Surfaces[1], &frameData[framenum]);
VERIFY(MFX_ERR_NONE == sts, "Could not write 1st vpp output");
framenum++;
break;
case MFX_ERR_MORE_DATA:
// The function requires more bitstream at input before decoding can proceed
isStillGoing = false;
break;
case MFX_ERR_MORE_SURFACE:
// The function requires more frame surface at output before decoding can proceed.
// This applies to external memory allocations and should not be expected for
// a simple internal allocation case like this
break;
case MFX_ERR_DEVICE_LOST:
// For non-CPU implementations,
// Cleanup if device is lost
break;
case MFX_WRN_DEVICE_BUSY:
// For non-CPU implementations,
// Wait a few milliseconds then try again
break;
case MFX_WRN_VIDEO_PARAM_CHANGED:
// The decoder detected a new sequence header in the bitstream.
// Video parameters may have changed.
// In external memory allocation case, might need to reallocate the output surface
break;
case MFX_ERR_INCOMPATIBLE_VIDEO_PARAM:
// The function detected that video parameters provided by the application
// are incompatible with initialization parameters.
// The application should close the component and then reinitialize it
break;
case MFX_ERR_REALLOC_SURFACE:
// Bigger surface_work required. May be returned only if
// mfxInfoMFX::EnableReallocRequest was set to ON during initialization.
// This applies to external memory allocations and should not be expected for
// a simple internal allocation case like this
break;
default:
printf("unknown status %d\n", sts);
isStillGoing = false;
break;
}
}
sts = MFXVideoDECODE_VPP_Close(session); // Helps massively! Halves the memory leak speed. Closes internal structures and tables.
VERIFY(MFX_ERR_NONE == sts, "Error closing VPP session\n");
result = 0;
end:
printf("Decode and VPP processed %d frames\n", framenum);
// Clean up resources - It is recommended to close components first, before
// releasing allocated surfaces, since some surfaces may still be locked by
// internal resources.
if (mfxVPPChParams)
delete[] mfxVPPChParams;
if (outSurfaces)
delete outSurfaces;
if (bitstream.Data)
free(bitstream.Data);
if (accelHandle)
FreeAcceleratorHandle(accelHandle);
if (loader)
MFXUnload(loader);
return result;
}
imgFrameWrapper.h
public ref class imgFrameWrapper
{
private:
size_t size;
array<System::Byte>^ bytes;
public:
imgFrameWrapper(size_t u_size, unsigned char* u_bytes);
~imgFrameWrapper();
!imgFrameWrapper();
size_t get_size();
array<System::Byte>^ get_bytes();
};
imgFrameWrapper.cpp
imgFrameWrapper::imgFrameWrapper(size_t u_size, unsigned char* u_bytes)
{
size = u_size;
bytes = gcnew array<System::Byte>(size);
Marshal::Copy((IntPtr)u_bytes, bytes, 0, size);
}
imgFrameWrapper::~imgFrameWrapper()
{
}
imgFrameWrapper::!imgFrameWrapper()
{
}
size_t imgFrameWrapper::get_size()
{
return size;
}
array<System::Byte>^ imgFrameWrapper::get_bytes()
{
return bytes;
}
imgFrame.h
struct imgFrame
{
int size;
unsigned char* bytes;
~imgFrame()
{
if (bytes)
delete[] bytes;
}
};
MFXVideoDECODE_VPP_DecodeFrameAsync() function creates internal memory surfaces for the processing.
You should release surfaces.
Please check this link it's mentioning about it.
https://spec.oneapi.com/onevpl/latest/API_ref/VPL_structs_decode_vpp.html#_CPPv415mfxSurfaceArray
mfxStatus (*Release)(struct mfxSurfaceArray *surface_array)¶
Decrements the internal reference counter of the surface. (*Release) should be
called after using the (*AddRef) function to add a surface or when allocation
logic requires it.
And please check this sample.
https://github.com/oneapi-src/oneVPL/blob/master/examples/hello-decvpp/src/hello-decvpp.cpp
Especially, WriteRawFrame_InternalMem() function in https://github.com/oneapi-src/oneVPL/blob/17968d8d2299352f5a9e09388d24e81064c81c87/examples/util/util/util.h
It shows how to release surfaces.

DirectX 11 Shader Reflection: Need help setting Constant Buffers variables by name

I'm creating a system that allows the manipulation of constant buffer variables by name using their byte offsets and byte sizes via Shader Reflection. I can bind the buffers to the Device Context just fine, but none of the cubes in my 3D scene are showing up. I believe this is because something is wrong with how I'm mapping data to the Constant Buffers. I cast the struct, be it a float4 or a matrix to a void pointer, and then copy that void pointer data to a another structure that has the variable's name. Once the shader needs to have its buffers updated before a draw call, I map the data of every structure in the list during the Map/Unmap call with a pointer iterator. Also, there seems to be a crash whenever the program calls the destructor on one of the shader constant structures. Below is the code i've written so far:
I'm mapping the buffer data through this algorithm here:
void DynamicConstantBuffer::UpdateChanges(ID3D11DeviceContext* pDeviceContext)
{
D3D11_MAPPED_SUBRESOURCE mappedResource;
HRESULT hr = pDeviceContext->Map(m_Buffer.Get(), 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
if (FAILED(hr)) return;
// Set mapped data
for (const auto& constant : m_BufferConstants)
{
char* startPosition = static_cast<char*>(mappedResource.pData) + constant.desc.StartOffset;
memcpy(startPosition, constant.pData, sizeof(constant.pData));
}
// Copy memory and unmap
pDeviceContext->Unmap(m_Buffer.Get(), 0);
}
I'm initializing the Constant Buffer here:
BOOL DynamicConstantBuffer::Initialize(UINT nBufferSlot, ID3D11Device* pDevice, ID3D11ShaderReflection* pShaderReflection)
{
ID3D11ShaderReflectionConstantBuffer* pReflectionBuffer = NULL;
D3D11_SHADER_BUFFER_DESC cbShaderDesc = {};
// Fetch constant buffer description
if (!(pReflectionBuffer = pShaderReflection->GetConstantBufferByIndex(nBufferSlot)))
return FALSE;
// Get description
pReflectionBuffer->GetDesc(&cbShaderDesc);
m_BufferSize = cbShaderDesc.Size;
// Create constant buffer on gpu end
D3D11_BUFFER_DESC cbDescription = {};
cbDescription.Usage = D3D11_USAGE_DYNAMIC;
cbDescription.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
cbDescription.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
cbDescription.ByteWidth = cbShaderDesc.Size;
if (FAILED(pDevice->CreateBuffer(&cbDescription, NULL, m_Buffer.GetAddressOf())))
return FALSE;
// Poll shader variables
for (UINT i = 0; i < cbShaderDesc.Variables; i++)
{
ID3D11ShaderReflectionVariable* pVariable = NULL;
pVariable = pReflectionBuffer->GetVariableByIndex(i);
// Get variable description
D3D11_SHADER_VARIABLE_DESC variableDesc = {};
pVariable->GetDesc(&variableDesc);
// Push variable back into list of variables
m_BufferConstants.push_back(ShaderConstant(variableDesc));
}
return TRUE;}
Here's the method that sets a variable within the constant buffer:
BOOL DynamicConstantBuffer::SetConstantVariable(const std::string& varName, const void* varData)
{
for (auto& v : m_BufferConstants)
{
if (v.desc.Name == varName)
{
memcpy(v.pData, varData, sizeof(varData));
bBufferDirty = TRUE;
return TRUE;
}
}
// No variable to assign :(
return FALSE;
}
Here's the class layout for ShaderConstant:
class ShaderConstant
{
public:
ShaderConstant(D3D11_SHADER_VARIABLE_DESC& desc)
{
this->desc = desc;
pData = new char[desc.Size];
_size = desc.Size;
};
~ShaderConstant()
{
if (!pData)
return;
delete[] pData;
pData = NULL;
}
D3D11_SHADER_VARIABLE_DESC desc;
void* pData;
size_t _size;
};
Any help at all would be appreciated. Thank you.

Is it possible to wait for a transfer from the staging buffer to complete without calling vkQueueWaitIdle

The following piece of code show you how i transfer a vertex buffer data from the staging buffer to a local memory buffer :
bool Vulkan::UpdateVertexBuffer(std::vector<VERTEX>& data, VULKAN_BUFFER& vertex_buffer)
{
std::memcpy(this->staging_buffer.pointer, &data[0], vertex_buffer.size);
size_t flush_size = static_cast<size_t>(vertex_buffer.size);
unsigned int multiple = static_cast<unsigned int>(flush_size / this->physical_device.properties.limits.nonCoherentAtomSize);
flush_size = this->physical_device.properties.limits.nonCoherentAtomSize * ((uint64_t)multiple + 1);
VkMappedMemoryRange flush_range = {};
flush_range.sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;
flush_range.pNext = nullptr;
flush_range.memory = this->staging_buffer.memory;
flush_range.offset = 0;
flush_range.size = flush_size;
vkFlushMappedMemoryRanges(this->device, 1, &flush_range);
VkResult result = vkWaitForFences(this->device, 1, &this->transfer.fence, VK_FALSE, 1000000000);
if(result != VK_SUCCESS) {
#if defined(_DEBUG)
std::cout << "UpdateVertexBuffer => vkWaitForFences : Timeout" << std::endl;
#endif
return false;
}
vkResetFences(this->device, 1, &this->transfer.fence);
VkCommandBufferBeginInfo command_buffer_begin_info = {};
command_buffer_begin_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
command_buffer_begin_info.pNext = nullptr;
command_buffer_begin_info.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
command_buffer_begin_info.pInheritanceInfo = nullptr;
vkBeginCommandBuffer(this->transfer.command_buffer, &command_buffer_begin_info);
VkBufferCopy buffer_copy_info = {};
buffer_copy_info.srcOffset = 0;
buffer_copy_info.dstOffset = 0;
buffer_copy_info.size = vertex_buffer.size;
vkCmdCopyBuffer(this->transfer.command_buffer, this->staging_buffer.handle, vertex_buffer.handle, 1, &buffer_copy_info);
VkBufferMemoryBarrier buffer_memory_barrier = {};
buffer_memory_barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
buffer_memory_barrier.pNext = nullptr;
buffer_memory_barrier.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT;
buffer_memory_barrier.dstAccessMask = VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT;
buffer_memory_barrier.srcQueueFamilyIndex = this->queue_stack[this->transfer_stack_index].index;
buffer_memory_barrier.dstQueueFamilyIndex = this->queue_stack[this->graphics_stack_index].index;
buffer_memory_barrier.buffer = vertex_buffer.handle;
buffer_memory_barrier.offset = 0;
buffer_memory_barrier.size = VK_WHOLE_SIZE;
vkCmdPipelineBarrier(this->transfer.command_buffer, VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_VERTEX_INPUT_BIT, 0, 0, nullptr, 1, &buffer_memory_barrier, 0, nullptr);
vkEndCommandBuffer(this->transfer.command_buffer);
VkSubmitInfo submit_info = {};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.pNext = nullptr;
submit_info.waitSemaphoreCount = 0;
submit_info.pWaitSemaphores = nullptr;
submit_info.pWaitDstStageMask = nullptr;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = &this->transfer.command_buffer;
submit_info.signalSemaphoreCount = 0;
submit_info.pSignalSemaphores = nullptr;
VkResult result = vkQueueSubmit(this->queue_stack[this->transfer_stack_index].handle, 1, &submit_info, this->transfer.fence);
if(result != VK_SUCCESS) {
#if defined(_DEBUG)
std::cout << "UpdateVertexBuffer => vkQueueSubmit : Failed" << std::endl;
#endif
return false;
}
#if defined(_DEBUG)
std::cout << "UpdateVertexBuffer : Success" << std::endl;
#endif
return true;
}
It works perfectly without any validation layer warning. But when i call i twice, both buffers contains the same data, from the second call. For example :
UpdateVertexBuffer(cube_data, cube_buffer);
UpdateVertexBuffer(prism_data, prism_buffer);
This will result in having a prism inside both cube_buffer and prism_buffer. To fix this, i can simply wait for a few milliseconds between the two calls :
UpdateVertexBuffer(cube_data, cube_buffer);
std::this_thread::sleep_for(std::chrono::milliseconds(100));
UpdateVertexBuffer(prism_data, prism_buffer);
or preferably, i can replace the fence by a call to
vkQueueWaitIdle(this->queue_stack[this->transfer_stack_index].handle);
In my opinion this will result in performance loss and the fence is supposed to be the optimal way to wait for transfer operation to complete properly, so why is my first buffer filled by second when i'm using a fence. And is there a way to do this properly without using vkQueueWaitIdle.
Thanks for your help.
You wait for the fence for the previous upload after you have already written the data to the staging buffer. That's too late; the fence is there to prevent you from writing data to memory that's being read.
But really, your problem is that your design is wrong. Your design is such that sequential updates all use the same memory. They shouldn't. Instead, sequential updates should use different regions of the same memory, so that they cannot overlap. That way, you can perform the transfers and not have to wait on fences at all (or at least, not until next frame).
Basically, you should treat your staging buffer like a ring buffer. Every operation that wants to do some staged transfer work should "allocate" X bytes of memory from the staging ring buffer. The staging buffer system allocates memory sequentially, wrapping around if there is insufficient space. But it also remembers where the last memory region is that it synchronized with. If you try to stage too much work, then it has to synchronize.
Also, one of the purposes behind mapping memory is that you can write directly to that memory, rather than writing to some other CPU memory and copying it in. So instead of passing in a VULKAN_BUFFER (whatever that is), the process that generated that data should have fetched a pointer to a region of the active staging buffer and written its data into that.
Oh, and one more thing: never, ever create a command buffer and immediately submit it. Just don't do it. There's a reason why vkQueueSubmit can take multiple command buffers, and multiple batches of command buffers. For any one queue, you should never be submitting more than once (or maybe twice) per frame.

nvEncRegisterResource() fails with -23

I've hit a complete brick wall in my attempt to use NVEnc to stream OpenGL frames as H264. I've been at this particular issue for close to 8 hours without any progress.
The problem is the call to nvEncRegisterResource(), which invariably fails with code -23 (enum value NV_ENC_ERR_RESOURCE_REGISTER_FAILED, documented as "failed to register the resource" - thanks NVidia).
I'm trying to follow a procedure outlined in this document from the University of Oslo (page 54, "OpenGL interop"), so I know for a fact that this is supposed to work, though unfortunately said document does not provide the code itself.
The idea is fairly straightforward:
map the texture produced by the OpenGL frame buffer object into CUDA;
copy the texture into a (previously allocated) CUDA buffer;
map that buffer as an NVEnc input resource
use that input resource as the source for the encoding
As I said, the problem is step (3). Here are the relevant code snippets (I'm omitting error handling for brevity.)
// Round up width and height
priv->encWidth = (_resolution.w + 31) & ~31, priv->encHeight = (_resolution.h + 31) & ~31;
// Allocate CUDA "pitched" memory to match the input texture (YUV, one byte per component)
cuErr = cudaMallocPitch(&priv->cudaMemPtr, &priv->cudaMemPitch, 3 * priv->encWidth, priv->encHeight);
This should allocate on-device CUDA memory (the "pitched" variety, though I've tried non-pitched too, without any change in the outcome.)
// Register the CUDA buffer as an input resource
NV_ENC_REGISTER_RESOURCE regResParams = { 0 };
regResParams.version = NV_ENC_REGISTER_RESOURCE_VER;
regResParams.resourceType = NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR;
regResParams.width = priv->encWidth;
regResParams.height = priv->encHeight;
regResParams.bufferFormat = NV_ENC_BUFFER_FORMAT_YUV444_PL;
regResParams.resourceToRegister = priv->cudaMemPtr;
regResParams.pitch = priv->cudaMemPitch;
encStat = nvEncApi.nvEncRegisterResource(priv->nvEncoder, &regResParams);
// ^^^ FAILS
priv->nvEncInpRes = regResParams.registeredResource;
This is the brick wall. No matter what I try, nvEncRegisterResource() fails.
I should note that I rather think (though I may be wrong) that I've done all the required initializations. Here is the code that creates and activates the CUDA context:
// Pop the current context
cuRes = cuCtxPopCurrent(&priv->cuOldCtx);
// Create a context for the device
priv->cuCtx = nullptr;
cuRes = cuCtxCreate(&priv->cuCtx, CU_CTX_SCHED_BLOCKING_SYNC, priv->cudaDevice);
// Push our context
cuRes = cuCtxPushCurrent(priv->cuCtx);
.. followed by the creation of the encoding session:
// Create an NV Encoder session
NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS nvEncSessParams = { 0 };
nvEncSessParams.apiVersion = NVENCAPI_VERSION;
nvEncSessParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
nvEncSessParams.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
nvEncSessParams.device = priv->cuCtx; // nullptr
auto encStat = nvEncApi.nvEncOpenEncodeSessionEx(&nvEncSessParams, &priv->nvEncoder);
And finally, the code initializing the encoder:
// Configure the encoder via preset
NV_ENC_PRESET_CONFIG presetConfig = { 0 };
GUID codecGUID = NV_ENC_CODEC_H264_GUID;
GUID presetGUID = NV_ENC_PRESET_LOW_LATENCY_DEFAULT_GUID;
presetConfig.version = NV_ENC_PRESET_CONFIG_VER;
presetConfig.presetCfg.version = NV_ENC_CONFIG_VER;
encStat = nvEncApi.nvEncGetEncodePresetConfig(priv->nvEncoder, codecGUID, presetGUID, &presetConfig);
NV_ENC_INITIALIZE_PARAMS initParams = { 0 };
initParams.version = NV_ENC_INITIALIZE_PARAMS_VER;
initParams.encodeGUID = codecGUID;
initParams.encodeWidth = priv->encWidth;
initParams.encodeHeight = priv->encHeight;
initParams.darWidth = 1;
initParams.darHeight = 1;
initParams.frameRateNum = 25; // TODO: make this configurable
initParams.frameRateDen = 1; // ditto
// .max_surface_count = (num_mbs >= 8160) ? 32 : 48;
// .buffer_delay ? necessary
initParams.enableEncodeAsync = 0;
initParams.enablePTD = 1;
initParams.presetGUID = presetGUID;
memcpy(&priv->nvEncConfig, &presetConfig.presetCfg, sizeof(priv->nvEncConfig));
initParams.encodeConfig = &priv->nvEncConfig;
encStat = nvEncApi.nvEncInitializeEncoder(priv->nvEncoder, &initParams);
All the above initializations report success.
I'd be extremely grateful to anyone who can get me past this hurdle.
EDIT: here is the complete code to reproduce the problem. The only observable difference to the original code is that cuPopContext() returns an error (which can be ignored) here - probably my original program creates such a context as a side effect of using OpenGL. Otherwise, the code behaves exactly as the original does.
I've built the code with Visual Studio 2013. You must link the following library file (adapt path if not on C:): C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v7.5\lib\Win32\cuda.lib
You must also make sure that C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include\ (or similar) is in the include path.
NEW EDIT: modified the code to only use the CUDA driver interface, instead of mixing with the runtime API. Still the same error code.
#ifdef _WIN32
#include <Windows.h>
#endif
#include <cassert>
#include <GL/gl.h>
#include <iostream>
#include <string>
#include <stdexcept>
#include <string>
#include <cuda.h>
//#include <cuda_runtime.h>
#include <cuda_gl_interop.h>
#include <nvEncodeAPI.h>
// NV Encoder API ---------------------------------------------------
#if defined(_WIN32)
#define LOAD_FUNC(l, s) GetProcAddress(l, s)
#define DL_CLOSE_FUNC(l) FreeLibrary(l)
#else
#define LOAD_FUNC(l, s) dlsym(l, s)
#define DL_CLOSE_FUNC(l) dlclose(l)
#endif
typedef NVENCSTATUS(NVENCAPI* PNVENCODEAPICREATEINSTANCE)(NV_ENCODE_API_FUNCTION_LIST *functionList);
struct NVEncAPI : public NV_ENCODE_API_FUNCTION_LIST {
public:
// ~NVEncAPI() { cleanup(); }
void init() {
#if defined(_WIN32)
if (sizeof(void*) == 8) {
nvEncLib = LoadLibrary(TEXT("nvEncodeAPI64.dll"));
}
else {
nvEncLib = LoadLibrary(TEXT("nvEncodeAPI.dll"));
}
if (nvEncLib == NULL) throw std::runtime_error("Failed to load NVidia Encoder library: " + std::to_string(GetLastError()));
#else
nvEncLib = dlopen("libnvidia-encode.so.1", RTLD_LAZY);
if (nvEncLib == nullptr)
throw std::runtime_error("Failed to load NVidia Encoder library: " + std::string(dlerror()));
#endif
auto nvEncodeAPICreateInstance = (PNVENCODEAPICREATEINSTANCE) LOAD_FUNC(nvEncLib, "NvEncodeAPICreateInstance");
version = NV_ENCODE_API_FUNCTION_LIST_VER;
NVENCSTATUS encStat = nvEncodeAPICreateInstance(static_cast<NV_ENCODE_API_FUNCTION_LIST *>(this));
}
void cleanup() {
#if defined(_WIN32)
if (nvEncLib != NULL) {
FreeLibrary(nvEncLib);
nvEncLib = NULL;
}
#else
if (nvEncLib != nullptr) {
dlclose(nvEncLib);
nvEncLib = nullptr;
}
#endif
}
private:
#if defined(_WIN32)
HMODULE nvEncLib;
#else
void* nvEncLib;
#endif
bool init_done;
};
static NVEncAPI nvEncApi;
// Encoder class ----------------------------------------------------
class Encoder {
public:
typedef unsigned int uint_t;
struct Size { uint_t w, h; };
Encoder() {
CUresult cuRes = cuInit(0);
nvEncApi.init();
}
void init(const Size & resolution, uint_t texture) {
NVENCSTATUS encStat;
CUresult cuRes;
texSize = resolution;
yuvTex = texture;
// Purely for information
int devCount = 0;
cuRes = cuDeviceGetCount(&devCount);
// Initialize NVEnc
initEncodeSession(); // start an encoding session
initEncoder();
// Register the YUV texture as a CUDA graphics resource
// CODE COMMENTED OUT AS THE INPUT TEXTURE IS NOT NEEDED YET (TO MY UNDERSTANDING) AT SETUP TIME
//cudaGraphicsGLRegisterImage(&priv->cudaInpTexRes, priv->yuvTex, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsReadOnly);
// Allocate CUDA "pitched" memory to match the input texture (YUV, one byte per component)
encWidth = (texSize.w + 31) & ~31, encHeight = (texSize.h + 31) & ~31;
cuRes = cuMemAllocPitch(&cuDevPtr, &cuMemPitch, 4 * encWidth, encHeight, 16);
// Register the CUDA buffer as an input resource
NV_ENC_REGISTER_RESOURCE regResParams = { 0 };
regResParams.version = NV_ENC_REGISTER_RESOURCE_VER;
regResParams.resourceType = NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR;
regResParams.width = encWidth;
regResParams.height = encHeight;
regResParams.bufferFormat = NV_ENC_BUFFER_FORMAT_YUV444_PL;
regResParams.resourceToRegister = (void*) cuDevPtr;
regResParams.pitch = cuMemPitch;
encStat = nvEncApi.nvEncRegisterResource(nvEncoder, &regResParams);
assert(encStat == NV_ENC_SUCCESS); // THIS IS THE POINT OF FAILURE
nvEncInpRes = regResParams.registeredResource;
}
void cleanup() { /* OMITTED */ }
void encode() {
// THE FOLLOWING CODE WAS NEVER REACHED YET BECAUSE OF THE ISSUE.
// INCLUDED HERE FOR REFERENCE.
CUresult cuRes;
NVENCSTATUS encStat;
cuRes = cuGraphicsResourceSetMapFlags(cuInpTexRes, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY);
cuRes = cuGraphicsMapResources(1, &cuInpTexRes, 0);
CUarray mappedArray;
cuRes = cuGraphicsSubResourceGetMappedArray(&mappedArray, cuInpTexRes, 0, 0);
cuRes = cuMemcpyDtoA(mappedArray, 0, cuDevPtr, 4 * encWidth * encHeight);
NV_ENC_MAP_INPUT_RESOURCE mapInputResParams = { 0 };
mapInputResParams.version = NV_ENC_MAP_INPUT_RESOURCE_VER;
mapInputResParams.registeredResource = nvEncInpRes;
encStat = nvEncApi.nvEncMapInputResource(nvEncoder, &mapInputResParams);
// TODO: encode...
cuRes = cuGraphicsUnmapResources(1, &cuInpTexRes, 0);
}
private:
struct PrivateData;
void initEncodeSession() {
CUresult cuRes;
NVENCSTATUS encStat;
// Pop the current context
cuRes = cuCtxPopCurrent(&cuOldCtx); // THIS IS ALLOWED TO FAIL (it doesn't
// Create a context for the device
cuCtx = nullptr;
cuRes = cuCtxCreate(&cuCtx, CU_CTX_SCHED_BLOCKING_SYNC, 0);
// Push our context
cuRes = cuCtxPushCurrent(cuCtx);
// Create an NV Encoder session
NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS nvEncSessParams = { 0 };
nvEncSessParams.apiVersion = NVENCAPI_VERSION;
nvEncSessParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
nvEncSessParams.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
nvEncSessParams.device = cuCtx;
encStat = nvEncApi.nvEncOpenEncodeSessionEx(&nvEncSessParams, &nvEncoder);
}
void Encoder::initEncoder()
{
NVENCSTATUS encStat;
// Configure the encoder via preset
NV_ENC_PRESET_CONFIG presetConfig = { 0 };
GUID codecGUID = NV_ENC_CODEC_H264_GUID;
GUID presetGUID = NV_ENC_PRESET_LOW_LATENCY_DEFAULT_GUID;
presetConfig.version = NV_ENC_PRESET_CONFIG_VER;
presetConfig.presetCfg.version = NV_ENC_CONFIG_VER;
encStat = nvEncApi.nvEncGetEncodePresetConfig(nvEncoder, codecGUID, presetGUID, &presetConfig);
NV_ENC_INITIALIZE_PARAMS initParams = { 0 };
initParams.version = NV_ENC_INITIALIZE_PARAMS_VER;
initParams.encodeGUID = codecGUID;
initParams.encodeWidth = texSize.w;
initParams.encodeHeight = texSize.h;
initParams.darWidth = texSize.w;
initParams.darHeight = texSize.h;
initParams.frameRateNum = 25;
initParams.frameRateDen = 1;
initParams.enableEncodeAsync = 0;
initParams.enablePTD = 1;
initParams.presetGUID = presetGUID;
memcpy(&nvEncConfig, &presetConfig.presetCfg, sizeof(nvEncConfig));
initParams.encodeConfig = &nvEncConfig;
encStat = nvEncApi.nvEncInitializeEncoder(nvEncoder, &initParams);
}
//void cleanupEncodeSession();
//void cleanupEncoder;
Size texSize;
GLuint yuvTex;
uint_t encWidth, encHeight;
CUdeviceptr cuDevPtr;
size_t cuMemPitch;
NV_ENC_CONFIG nvEncConfig;
NV_ENC_INPUT_PTR nvEncInpBuf;
NV_ENC_REGISTERED_PTR nvEncInpRes;
CUdevice cuDevice;
CUcontext cuCtx, cuOldCtx;
void *nvEncoder;
CUgraphicsResource cuInpTexRes;
};
int main(int argc, char *argv[])
{
Encoder encoder;
encoder.init({1920, 1080}, 0); // OMITTED THE TEXTURE AS IT IS NOT NEEDED TO REPRODUCE THE ISSUE
return 0;
}
After comparing the NVidia sample NvEncoderCudaInterop with my minimal code, I finally found the item that makes the difference between success and failure: its the pitch parameter of the NV_ENC_REGISTER_RESOURCE structure passed to nvEncRegisterResource().
I haven't seen it documented anywhere, but there's a hard limit on that value, which I've determined experimentally to be at 2560. Anything above that will result in NV_ENC_ERR_RESOURCE_REGISTER_FAILED.
It does not appear to matter that the pitch I was passing was calculated by another API call, cuMemAllocPitch().
(Another thing that was missing from my code was "locking" and unlocking the CUDA context to the current thread via cuCtxPushCurrent() and cuCtxPopCurrent(). Done in the sample via a RAII class.)
EDIT:
I have worked around the problem by doing something for which I had another reason: using NV12 as input format for the encoder instead of YUV444.
With NV12, the pitch parameter drops below the 2560 limit because the byte size per row is equal to the width, so in my case 1920 bytes.
This was necessary (at the time) because my graphics card was a GTX 760 with a "Kepler" GPU, which (as I was initially unaware) only supports NV12 as input format for NVEnc. I have since upgraded to a GTX 970, but as I just found out, the 2560 limit is still there.
This makes me wonder just how exactly one is expected to use NVEnc with YUV444. The only possibility that comes to my mind is to use non-pitched memory, which seems bizarre. I'd appreciate comments from people who've actually used NVEnc with YUV444.
EDIT #2 - PENDING FURTHER UPDATE:
New information has surfaced in the form of another SO question: NVencs Output Bitstream is not readable
It is quite possible that my answer so far was wrong. It seems now that the pitch should not only be set when registering the CUDA resource, but also when actually sending it to the encoder via nvEncEncodePicture(). I cannot check this right now, but I will next time I work on that project.

Is it possible to cache mapped regions returned from MapViewOfFile?

Good afternoon, It is a well known fact that when dealing with large files
that cannot be mapped to one view in Win32, create code that carefully maps
and unmaps file regions as they are needed. The pastebin url is:
I created and tested a cMemoryMappedFile class that deals with large files
that cannot be mapped to one view in Win32. I tested the class and found
that while it functions OK, it takes a long time(i.e 3 seconds) for
random access. This is because the class has to unmap and map a file
region for every random access. I was wondering if it was possible to
cache the mapped regions returned from MapViewFile to speed up random access.
Yesterday, I noticed that UnMapViewOfFile invalidates a previously
mapped region returned from MapViewOfFile. Does anyone have ideas
about how to speed up random access through caching or other methods?
Currently the viewport is 128KB. I believe that if I enlarge the
viewport it will reduce the number of calls to UnMapViewOfFile
and MapViewOfFile. However, I was wondering if could use other
methods. Please look at the method,
char* cMemoryMappedFile::GetPointer(int , bool) to see how the
viewport is shifted with the file mapping. Thank you.
The pastebin url for the class is
> .
I am adding the source code here in case no one can access the url.
// cMemoryMappedFile.Cpp
#include "cException.h"
#include "cMemoryMappedFile.h"
#define BUFFER_SIZE 10
#define MEM_BLOCK_SIZE 65536 * 2
/**
\class cMemoryMappedFile
\brief Encapsulation of the Windows Memory Management API.
The cMemoryMapped class makes some memory mapping operations easier.
*/
/**
\brief Constructor for cMemoryMappedFile object.
\param FileSize Size of file.
\param OpenMode File open mode
\param AccessModes File access mode
\param ShareMode File sharing mode
\param Flags File attributes and flags
\param ShareMode File sharing mode
\param Flags File attributes and flags
\param Security Security Attributes
\param Template Extended attributes tp apply to a newly created file
*/
cMemoryMappedFile::cMemoryMappedFile(long FileSize_, OpenModes OpenMode_,AccessModes AccessMode_,
ShareModes ShareMode_,long Flags_,void *Security_,FILEHANDLE Template_) {
FileSize = FileSize_;
char buffer[BUFFER_SIZE];
DWORD dwRetVal = 0;
UINT uRetVal = 0;
DWORD dwPtr = 0;
BOOL isSetEndOfFile = FALSE;
LARGE_INTEGER Distance_;
DWORD ErrorCode = 0;
char lpTempPathBuffer[MAX_PATH];
PreviousNCopy = 0;
PreviousN = 0;
// Gets the temp path env string (no guarantee it's a valid path).
dwRetVal = GetTempPath(MAX_PATH, // length of the buffer
lpTempPathBuffer); // buffer for path
if (dwRetVal > MAX_PATH || (dwRetVal == 0))
{
throw cException(ERR_MEMORYMAPPING,"");
}
// Generates a temporary file name.
uRetVal = GetTempFileName(lpTempPathBuffer, // directory for tmp files
TEXT("DEMO"), // temp file name prefix
0, // create unique name
TempFileName); // buffer for name
if (uRetVal == 0)
{
throw cException(ERR_MEMORYMAPPING,lpTempPathBuffer);
}
// Creates the new file
hFile = CreateFile((LPTSTR) TempFileName, // file name
AccessMode_, // open for write
0, // do not share
(SECURITY_ATTRIBUTES *) Security_, // default security
OpenMode_, // CREATE_ALWAYS,
Flags_,// normal file
Template_); // no template
if (hFile == INVALID_HANDLE_VALUE)
{
throw cException(ERR_MEMORYMAPPING,TempFileName);
}
Distance_.LowPart = (ULONG)FileSize_;
Distance_.HighPart = 0; // (ULONG)(FileSize_ >> 32);
dwPtr = ::SetFilePointer(hFile,Distance_.LowPart,
&(Distance_.HighPart), FileBegin);
if (dwPtr == INVALID_SET_FILE_POINTER){
throw cException(ERR_MEMORYMAPPING,TempFileName);
}
isSetEndOfFile = SetEndOfFile(hFile);
if (!isSetEndOfFile){
ErrorCode = GetLastError();
throw cException(ERR_MEMORYMAPPING,TempFileName);
}
hMapping=::CreateFileMapping(hFile,(SECURITY_ATTRIBUTES *)Security_,PAGE_READWRITE,0,0,0);
if (hMapping==NULL)
throw cException(ERR_MEMORYMAPPING,TempFileName);
MapPtr = 0;
adjustedptr = 0;
prevadjustedptr = adjustedptr;
FilePath=new char[strlen(TempFileName)+1];
strcpy(FilePath,TempFileName);
}
char * cMemoryMappedFile::GetPointer(int n, bool Caching){
unsigned int baseoff;
if( n < MEM_BLOCK_SIZE / 2)
{
baseoff = 0;
}
else
{
baseoff = ((n + MEM_BLOCK_SIZE / 4) &
(~(MEM_BLOCK_SIZE / 2 - 1))) - MEM_BLOCK_SIZE / 2;
}
// the correct memory mapped view is already mapped in
if (adjustedptr != 0 && mappedoffset == baseoff && Caching)
return adjustedptr;
else if (Caching)
{
/*
retrieve adjustedptr from cache
*/
}
// get a new memory mapped viewport
else{
if (MapPtr){
UnmapViewOfFile(MapPtr);
PreviousNCopy = PreviousN;
prevadjustedptr = adjustedptr;
}
PreviousN = n;
mappedlength = min(FileSize - baseoff, MEM_BLOCK_SIZE);
// MapViewOfFile should be aligned to 64K boundary
MapPtr = (char*)::MapViewOfFile( hMapping,
FILE_MAP_WRITE | FILE_MAP_READ, 0,
baseoff, mappedlength);
mappedoffset = baseoff;
adjustedptr = MapPtr - mappedoffset;
printf("Value: %u n: %u\n",adjustedptr[n],n);
/*
cache PreviousNCopy,PreviousN, prevadjustedptr[PreviousNCopy]
*/
}
return adjustedptr;
}
You could have a "free list" style cache --- when the user of your class asks to unmap a region you don't really, you just add it to the list. When they ask to map a new region then you reuse an existing mapping if possible, otherwise you create a new mapping, deleting the least-recently-used one from the cache if you've got too many mappings open, or where the mapped size of the cached mappings is too large.