NVencs Output Bitstream is not readable - c++

I have one question related to Nvidias NVenc API. I want to use the API to encode some OpenGL graphics. My problem is, that the API reports no error throughout the whole program, everything seems to be fine. But the generated output is not readable by, e.g. VLC. If I try to play the generated file, VLC would flash a black screen for about 0.5s, then ends the playback.
The Video has the length of 0, the size of the Vid seems rather small, too.
Resolution is 1280*720 and the size of 5secs recording is only 700kb. Is this realistic?
The flow of the application is as following:
Render to secondary Framebuffer
Download Framebuffer to one of two PBOs (glReadPixels())
Map the PBO of the previous frame, to get a pointer understandable by Cuda.
Call a simple CudaKernel converting OpenGLs RGBA to ARGB which should be understandable by NVenc according to this(p.18). The kernel reads the content of the PBO and writes the converted content into a CudaArray (created with cudaMalloc) which is registered as InputResource with NVenc.
The content of the converted Array gets encoded. A completion event plus the corresponding output bitstream buffer get queued.
A secondary thread listens on the queued output events, if one event is signaled, the Output Bitstream gets mapped and written to hdd.
The initializion of NVenc-Encoder:
InitParams* ip = new InitParams();
m_initParams = ip;
memset(ip, 0, sizeof(InitParams));
ip->version = NV_ENC_INITIALIZE_PARAMS_VER;
ip->encodeGUID = m_encoderGuid; //Used Codec
ip->encodeWidth = width; // Frame Width
ip->encodeHeight = height; // Frame Height
ip->maxEncodeWidth = 0; // Zero means no dynamic res changes
ip->maxEncodeHeight = 0;
ip->darWidth = width; // Aspect Ratio
ip->darHeight = height;
ip->frameRateNum = 60; // 60 fps
ip->frameRateDen = 1;
ip->reportSliceOffsets = 0; // According to programming guide
ip->enableSubFrameWrite = 0;
ip->presetGUID = m_presetGuid; // Used Preset for Encoder Config
NV_ENC_PRESET_CONFIG presetCfg; // Load the Preset Config
memset(&presetCfg, 0, sizeof(NV_ENC_PRESET_CONFIG));
presetCfg.version = NV_ENC_PRESET_CONFIG_VER;
presetCfg.presetCfg.version = NV_ENC_CONFIG_VER;
CheckApiError(m_apiFunctions.nvEncGetEncodePresetConfig(m_Encoder,
m_encoderGuid, m_presetGuid, &presetCfg));
memcpy(&m_encodingConfig, &presetCfg.presetCfg, sizeof(NV_ENC_CONFIG));
// And add information about Bitrate etc
m_encodingConfig.rcParams.averageBitRate = 500000;
m_encodingConfig.rcParams.maxBitRate = 600000;
m_encodingConfig.rcParams.rateControlMode = NV_ENC_PARAMS_RC_MODE::NV_ENC_PARAMS_RC_CBR;
ip->encodeConfig = &m_encodingConfig;
ip->enableEncodeAsync = 1; // Async Encoding
ip->enablePTD = 1; // Encoder handles picture ordering
Registration of CudaResource
m_cuContext->SetCurrent(); // Make the clients cuCtx current
NV_ENC_REGISTER_RESOURCE res;
memset(&res, 0, sizeof(NV_ENC_REGISTER_RESOURCE));
NV_ENC_REGISTERED_PTR resPtr; // handle to the cuda resource for future use
res.bufferFormat = m_inputFormat; // Format is ARGB
res.height = m_height;
res.width = m_width;
// NOTE: I've set the pitch to the width of the frame, because the resource is a non-pitched
//cudaArray. Is this correct? Pitch = 0 would produce no output.
res.pitch = pitch;
res.resourceToRegister = (void*) (uintptr_t) resourceToRegister; //CUdevptr to resource
res.resourceType =
NV_ENC_INPUT_RESOURCE_TYPE::NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR;
res.version = NV_ENC_REGISTER_RESOURCE_VER;
CheckApiError(m_apiFunctions.nvEncRegisterResource(m_Encoder, &res));
m_registeredInputResources.push_back(res.registeredResource);
Encoding
m_cuContext->SetCurrent(); // Make Clients context current
MapInputResource(id); //Map the CudaInputResource
NV_ENC_PIC_PARAMS temp;
memset(&temp, 0, sizeof(NV_ENC_PIC_PARAMS));
temp.version = NV_ENC_PIC_PARAMS_VER;
unsigned int currentBufferAndEvent = m_counter % m_registeredEvents.size(); //Counter is inc'ed in every Frame
temp.bufferFmt = m_currentlyMappedInputBuffer.mappedBufferFmt;
temp.inputBuffer = m_currentlyMappedInputBuffer.mappedResource; //got set by MapInputResource
temp.completionEvent = m_registeredEvents[currentBufferAndEvent];
temp.outputBitstream = m_registeredOutputBuffers[currentBufferAndEvent];
temp.inputWidth = m_width;
temp.inputHeight = m_height;
temp.inputPitch = m_width;
temp.inputTimeStamp = m_counter;
temp.pictureStruct = NV_ENC_PIC_STRUCT_FRAME; // According to samples
temp.qpDeltaMap = NULL;
temp.qpDeltaMapSize = 0;
EventWithId latestEvent(currentBufferAndEvent,
m_registeredEvents[currentBufferAndEvent]);
PushBackEncodeEvent(latestEvent); // Store the Event with its ID in a Queue
CheckApiError(m_apiFunctions.nvEncEncodePicture(m_Encoder, &temp));
m_counter++;
UnmapInputResource(id); // Unmap
Every little hint, where to look at, is very much appreciated. I'm running out of ideas what might be wrong.
Thanks a lot!

With the help of hall822 from the nvidia forums I managed to solve the issue.
The primary error was that I registered my cuda resource with a pitch equal to the size of the frame. I'm using a Framebuffer-Renderbuffer to draw my content into. The data of this is a plain, unpitched array. My first thought, giving a pitch equal to zero, failed. The encoder did nothing. The next idea was to set it to the width of the frame, a quarter of the image was encoded.
// NOTE: I've set the pitch to the width of the frame, because the resource is a non-pitched
//cudaArray. Is this correct? Pitch = 0 would produce no output.
res.pitch = pitch;
To answer this question: Yes, it is correct. But the pitch is measured in byte. So because I'm encoding RGBA-Frames, the correct pitch has to be FRAME_WIDTH * 4.
The second error was that my color channels were not right (See point 4 in my opening post). The NVidia enum says that the encoder expects the channels in ARGB format but actually ment is BGRA, so the alpha channel which is always 255 polluted the blue channel.
Edit: This may be due to the fact that NVidia is using little endian internally. I'm writing
my pixel data to a byte array, choosing an other type like int32 may allow one to pass actual ARGB data.

Related

Getting audio sound level from FLTP audio stream

I need to get audio level or even better, EQ data from NDI audio stream in C++. Here's the struct of a audio packet:
// This describes an audio frame.
typedef struct NDIlib_audio_frame_v3_t {
// The sample-rate of this buffer.
int sample_rate;
// The number of audio channels.
int no_channels;
// The number of audio samples per channel.
int no_samples;
// The timecode of this frame in 100-nanosecond intervals.
int64_t timecode;
// What FourCC describing the type of data for this frame.
NDIlib_FourCC_audio_type_e FourCC;
// The audio data.
uint8_t* p_data;
union {
// If the FourCC is not a compressed type and the audio format is planar, then this will be the
// stride in bytes for a single channel.
int channel_stride_in_bytes;
// If the FourCC is a compressed type, then this will be the size of the p_data buffer in bytes.
int data_size_in_bytes;
};
// Per frame metadata for this frame. This is a NULL terminated UTF8 string that should be in XML format.
// If you do not want any metadata then you may specify NULL here.
const char* p_metadata;
// This is only valid when receiving a frame and is specified as a 100-nanosecond time that was the exact
// moment that the frame was submitted by the sending side and is generated by the SDK. If this value is
// NDIlib_recv_timestamp_undefined then this value is not available and is NDIlib_recv_timestamp_undefined.
int64_t timestamp;
#if NDILIB_CPP_DEFAULT_CONSTRUCTORS
NDIlib_audio_frame_v3_t(
int sample_rate_ = 48000, int no_channels_ = 2, int no_samples_ = 0,
int64_t timecode_ = NDIlib_send_timecode_synthesize,
NDIlib_FourCC_audio_type_e FourCC_ = NDIlib_FourCC_audio_type_FLTP,
uint8_t* p_data_ = NULL, int channel_stride_in_bytes_ = 0,
const char* p_metadata_ = NULL,
int64_t timestamp_ = 0
);
#endif // NDILIB_CPP_DEFAULT_CONSTRUCTORS
} NDIlib_audio_frame_v3_t;
Problem is that unlike video frames I have absolutely no idea how binary audio is packed and there's much less information about it online. The best information I found so far is this project:
https://github.com/gavinnn101/fishing_assistant/blob/7f5fcd73de1e39336226b5969cd1c5ca84c8058b/fishing_main.py#L124
It uses PyAudio however which I'm not familiar with and they use 16 bit audio format while mine seems to be 32bit and I can't figure out the struct.unpack stuff either because "%dh"%(count) is telling it some number then h for short which I don't understand how it would interpret.
Is there any C++ library that can take pointer to the data and type then has functions to extract sound level, sound level at certain hertz etc?
Or just some good information on how I would extract this myself? :)
I've searched the web a lot while finding very little. I've placed a breakpoint when the audio frame is populated but given up once I realize there's too many variables to think of that I don't have a clue about like sample rate, channels, sample count etc.
Got it working using
// This function calculates the RMS value of an audio frame
float calculateRMS(const NDIlib_audio_frame_v2_t& frame)
{
// Calculate the number of samples in the frame
int numSamples = frame.no_samples * frame.no_channels;
// Get a pointer to the start of the audio data
const float* data = frame.p_data;
// Calculate the sum of the squares of the samples
float sumSquares = 0.0f;
for (int i = 0; i < numSamples; ++i)
{
float sample = data[i];
sumSquares += sample * sample;
}
// Calculate the RMS value and return it
return std::sqrt(sumSquares / numSamples);
}
called as
// Keep receiving audio frames and printing their RMS values
NDIlib_audio_frame_v2_t audioFrame;
while (true)
{
// Wait for the next audio frame to be received
if (NDIlib_recv_capture_v2(pNDI_recv, NULL, &audioFrame, NULL, 0) != NDIlib_frame_type_audio)
continue;
// Print the RMS value of the audio frame
std::cout << "RMS: " << calculateRMS(audioFrame) << std::endl;
NDIlib_recv_free_audio_v2(pNDI_recv, &audioFrame);
}
Shoutout to chatGPT for explaining and feeding me with possible solutions until I managed to get a working solution :--)

DirectShow: Rendering for YV12

I am trying to understand how to work the rendering for YV12 format. For example, I took a simple sample. See this graph:
The webcam creates frames by size 640x480 in RGB24 or MJPEG . After it the LAV decoder transforms the frames to YV12 and sends them to DS renderer (EVR or VMR9).
The decoder changes the frame width (stride) 640 on 1024. Hence, the output size of frame will be 1.5*1024*640=737280. The normal size for YV12 is 1.5*640*480=460800. I know the stride can be more than the width of real frame (https://learn.microsoft.com/en-us/windows/desktop/medfound/image-stride). My first question - why did the renderer select that value (1024) than another? Can I get it programmatically?
When I replace the LAV decoder with my filter for transformation RGB24/YV12 (https://gist.github.com/thedeemon/8052fb98f8ba154510d7), the renderer shows me a shifted image, though all parameters are the same, as for the first graph:
Why? I noted that VIDEOINFOHEADER2 had the set interlacing flag dwInterlaceFlags. Therefore my next question: Do I have to add interlacing into my filter for normal work of renderer?
My first question - why did the renderer select that value (1024) than another? Can I get it programmatically?
Video renderer is using Direct3D texture as a carrier for the image. When texture is mapped into system memory to enable CPU's write access such extended stride could be applied because of specifics of implementation of video hardware. You get the value 1024 via dynamic media type negotiation as described in Handling Format Changes from the Video Renderer.
Your transformation filter has to handle such updates if you want it to be able to connect to video renderer directly.
You are generally not interested in getting this extended stride value otherwise because the one you get via media type update is the one to be used and you have to accept it.
When I replace the LAV decoder with my filter for transformation RGB24/YV12, the renderer shows me a shifted image, though all parameters are the same, as for the first graph...
Why?
Your filter does not handle stride update right.
...I noted that VIDEOINFOHEADER2 had the set interlacing flag dwInterlaceFlags. Therefore my next question: Do I have to add interlacing into my filter for normal work of renderer?
You don't have interlaced video here. The problem is unrelated to interlaced video.
My solution:
I must right copy a YV12 frame into a video buffer by its three surfaces: Y = 4x4, U = 1x2, V = 1x2. Here is a code for the frame size 640x480:
CVideoGrabberFilter::Transform(IMediaSample *pIn, IMediaSample *pOut)
{
BYTE* pSrcBuf = 0;
pIn->GetPointer(&pSrcBuf);
BYTE* pDstBuf = 0;
pOut->GetPointer(&pDstBuf);
SIZE size;
size.cx = 640;
size.cy = 480;
int nLen = pOut->GetActualDataLength();
BYTE* pDstTmp = new BYTE[nLen];
YV12ConverterFromRGB24(pSrcBufEnd, pDstTmp, size.cx, size.cy);
BYTE* pDst = pDstTmp;
int stride = 1024; //the real video stride for 640x480. For other resolutions you need to use pOut->GetMediaType() for the stride defining.
//Y
for (int y = 0; y < size.cy; ++y)
{
memcpy(pDstBuf, pDst, size.cx);
pDst += size.cx;
pDstBuf += stride;
}
stride /= 2;
size.cy /= 2;
size.cx /= 2;
//U and V
for (int y = 0; y < size.cy; y++ )
{
memcpy(pDstBuf, pDst, size.cx );
pDst += size.cx;
pDstBuf += stride;
memcpy(pDstBuf, pDst, size.cx);
pDst += size.cx;
pDstBuf += stride;
}
delete[] pDstTmp;
}

OpenGL to FFMpeg encode

I have a opengl buffer that I need to forward directly to ffmpeg to do the nvenc based h264 encoding.
My current way of doing this is glReadPixels to get the pixels out of the frame buffer and then passing that pointer into ffmpeg such that it can encode the frame into H264 packets for RTSP. However, this is bad because I have to copy bytes out of the GPU ram into CPU ram, to only copy them back into the GPU for encoding.
If you look at the date of posting versus the date of this answer you'll notice I spent much time working on this. (It was my full time job the past 4 weeks).
Since I had such a difficult time getting this to work I will write up a short guide to hopefully help out whomever finds this.
Outline
The basic flow I have is OGL Frame buffer object color attachement (texture) → nvenc (nvidia encoder)
Things to note
Some things to note:
1) The nvidia encoder can accept YUV or RGB type images.
2) FFMPEG 4.0 and under cannot pass RGB images to nvenc.
3) FFMPEG was updated to accept RGB as input, per my issues.
There are a couple different things to know about:
1) AVHWDeviceContext- Think of this as ffmpegs device abstraction layer.
2) AVHWFramesContext- Think of this as ffmpegs hardware frame abstraction layer.
3) cuMemcpy2D- The required method to copy a cuda mapped OGL texture into a cuda buffer created by ffmpeg.
Comprehensiveness
This guide is in addition to standard software encoding guidelines. This is NOT complete code, and should only be used in addition to the standard flow.
Code details
Setup
You will need to first get your gpu name, to do this I found some code (I cannot remember where I got it from) that made some cuda calls and got the GPU name:
int getDeviceName(std::string& gpuName)
{
//Setup the cuda context for hardware encoding with ffmpeg
NV_ENC_BUFFER_FORMAT eFormat = NV_ENC_BUFFER_FORMAT_IYUV;
int iGpu = 0;
CUresult res;
ck(cuInit(0));
int nGpu = 0;
ck(cuDeviceGetCount(&nGpu));
if (iGpu < 0 || iGpu >= nGpu)
{
std::cout << "GPU ordinal out of range. Should be within [" << 0 << ", "
<< nGpu - 1 << "]" << std::endl;
return 1;
}
CUdevice cuDevice = 0;
ck(cuDeviceGet(&cuDevice, iGpu));
char szDeviceName[80];
ck(cuDeviceGetName(szDeviceName, sizeof(szDeviceName), cuDevice));
gpuName = szDeviceName;
epLog::msg(epMSG_STATUS, "epVideoEncode:H264Encoder", "...using device \"%s\"", szDeviceName);
return 0;
}
Next you will need to setup your hwdevice and hwframe contexts:
getDeviceName(gpuName);
ret = av_hwdevice_ctx_create(&m_avBufferRefDevice, AV_HWDEVICE_TYPE_CUDA, gpuName.c_str(), NULL, NULL);
if (ret < 0)
{
return -1;
}
//Example of casts needed to get down to the cuda context
AVHWDeviceContext* hwDevContext = (AVHWDeviceContext*)(m_avBufferRefDevice->data);
AVCUDADeviceContext* cudaDevCtx = (AVCUDADeviceContext*)(hwDevContext->hwctx);
m_cuContext = &(cudaDevCtx->cuda_ctx);
//Create the hwframe_context
// This is an abstraction of a cuda buffer for us. This enables us to, with one call, setup the cuda buffer and ready it for input
m_avBufferRefFrame = av_hwframe_ctx_alloc(m_avBufferRefDevice);
//Setup some values before initialization
AVHWFramesContext* frameCtxPtr = (AVHWFramesContext*)(m_avBufferRefFrame->data);
frameCtxPtr->width = width;
frameCtxPtr->height = height;
frameCtxPtr->sw_format = AV_PIX_FMT_0BGR32; // There are only certain supported types here, we need to conform to these types
frameCtxPtr->format = AV_PIX_FMT_CUDA;
frameCtxPtr->device_ref = m_avBufferRefDevice;
frameCtxPtr->device_ctx = (AVHWDeviceContext*)m_avBufferRefDevice->data;
//Initialization - This must be done to actually allocate the cuda buffer.
// NOTE: This call will only work for our input format if the FFMPEG library is >4.0 version..
ret = av_hwframe_ctx_init(m_avBufferRefFrame);
if (ret < 0) {
return -1;
}
//Cast the OGL texture/buffer to cuda ptr
CUresult res;
CUcontext oldCtx;
m_inputTexture = texture;
res = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
res = cuCtxPushCurrent(*m_cuContext);
res = cuGraphicsGLRegisterImage(&cuInpTexRes, m_inputTexture, GL_TEXTURE_2D, CU_GRAPHICS_REGISTER_FLAGS_READ_ONLY);
res = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
//Assign some hardware accel specific data to AvCodecContext
c->hw_device_ctx = m_avBufferRefDevice;//This must be done BEFORE avcodec_open2()
c->pix_fmt = AV_PIX_FMT_CUDA; //Since this is a cuda buffer, although its really opengl with a cuda ptr
c->hw_frames_ctx = m_avBufferRefFrame;
c->codec_type = AVMEDIA_TYPE_VIDEO;
c->sw_pix_fmt = AV_PIX_FMT_0BGR32;
// Setup some cuda stuff for memcpy-ing later
m_memCpyStruct.srcXInBytes = 0;
m_memCpyStruct.srcY = 0;
m_memCpyStruct.srcMemoryType = CUmemorytype::CU_MEMORYTYPE_ARRAY;
m_memCpyStruct.dstXInBytes = 0;
m_memCpyStruct.dstY = 0;
m_memCpyStruct.dstMemoryType = CUmemorytype::CU_MEMORYTYPE_DEVICE;
Keep in mind, although there is a lot done above, the code shown is IN ADDITION to the standard software encoding code. Make sure to include all those calls/object initialization as well.
Unlike the software version, all that is needed for the input AVFrame object is to get the buffer AFTER your alloc call:
// allocate RGB video frame buffer
ret = av_hwframe_get_buffer(m_avBufferRefFrame, rgb_frame, 0); // 0 is for flags, not used at the moment
Notice it takes in the hwframe_context as an argument, this is how it knows what device, size, format, etc to allocate for on the gpu.
Call to encode each frame
Now we are setup, and are ready to encode. Before each encode we need to copy the frame from the texture to a cuda buffer. We do this by mapping a cuda array to the texture then copying that array to a cuDeviceptr (which was allocated by the av_hwframe_get_buffer call above):
//Perform cuda mem copy for input buffer
CUresult cuRes;
CUarray mappedArray;
CUcontext oldCtx;
//Get context
cuRes = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
cuRes = cuCtxPushCurrent(*m_cuContext);
//Get Texture
cuRes = cuGraphicsResourceSetMapFlags(cuInpTexRes, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY);
cuRes = cuGraphicsMapResources(1, &cuInpTexRes, 0);
//Map texture to cuda array
cuRes = cuGraphicsSubResourceGetMappedArray(&mappedArray, cuInpTexRes, 0, 0); // Nvidia says its good practice to remap each iteration as OGL can move things around
//Release texture
cuRes = cuGraphicsUnmapResources(1, &cuInpTexRes, 0);
//Setup for memcopy
m_memCpyStruct.srcArray = mappedArray;
m_memCpyStruct.dstDevice = (CUdeviceptr)rgb_frame->data[0]; // Make sure to copy devptr as it could change, upon resize
m_memCpyStruct.dstPitch = rgb_frame->linesize[0]; // Linesize is generated by hwframe_context
m_memCpyStruct.WidthInBytes = rgb_frame->width * 4; //* 4 needed for each pixel
m_memCpyStruct.Height = rgb_frame->height; //Vanilla height for frame
//Do memcpy
cuRes = cuMemcpy2D(&m_memCpyStruct);
//release context
cuRes = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
Now we can simply call send_frame and it all works!
ret = avcodec_send_frame(c, rgb_frame);
Note: I left most of my code out, since it is not for the public. I may have some details incorrect, this is how I was able to make sense of all the data I gathered over the past month...feel free to correct anything that is incorrect. Also, fun fact, during all this my computer crashed an I lost all my initial investigation (everything I didnt check into source control), which includes all the various example code I found around the internet. So if you see something an its yours, call it out please. This can help others come to the conclusion that I came to.
Shoutout
Big shout out to BtbN at https://webchat.freenode.net/ #ffmpeg, I wouldnt have gotten any of this without their help.
First thing to check is that it may be "bad" but is it running fast enough anyway? It's always nice to be more efficient but if it works, don't break it.
If there really is a performance problem...
1 Use FFMPEG software encoding only, without hardware assistance. Then you'll only be copying from GPU to CPU once. (If the video encoder is on the GPU and you're sending packets out via RTSP, there's a second GPU to CPU after encoding.)
2 Look for an NVIDIA (I assume that's the GPU given you talk about nvenc) GL extension to texture formats and/or commands that will perform on GPU H264 encoding directly to OpenGL buffers.

Error runtime update of DXT compressed textures with Directx11

Context:
I'm developing a native C++ Unity 5 plugin that reads in DXT compressed texture data and uploads it to the GPU for further use in Unity. The aim is to create an fast image-sequence player, updating image data on-the-fly. The textures are compressed with an offline console application.
Unity can work with different graphics engines, I'm aiming towards DirectX11 and OpenGL 3.3+.
Problem:
The DirectX runtime texture update code, through a mapped subresource, gives different outputs on different graphics drivers. Updating a texture through such a mapped resource means mapping a pointer to the texture data and memcpy'ing the data from the RAM buffer to the mapped GPU buffer. Doing so, different drivers seem to expect different parameters for the row pitch value when copying bytes. I never had problems on the several Nvidia GPU's I tested on, but AMD and Intel GPU seems to act differently and I get distorted output as shown underneath. Furthermore, I'm working with DXT1 pixel data (0.5bpp) and DXT5 data (1bpp). I can't seem to get the correct pitch parameter for these DXT textures.
Code:
The following initialisation code for generating the d3d11 texture and filling it with initial texture data - e.g. the first frame of an image sequence - works perfect on all drivers. The player pointer points to a custom class that handles all file reads and contains getters for the current loaded DXT compressed frame, it's dimensions, etc...
if (s_DeviceType == kUnityGfxRendererD3D11)
{
HRESULT hr;
DXGI_FORMAT format = (compression_type == DxtCompressionType::DXT_TYPE_DXT1_NO_ALPHA) ? DXGI_FORMAT_BC1_UNORM : DXGI_FORMAT_BC3_UNORM;
// Create texture
D3D11_TEXTURE2D_DESC desc;
desc.Width = w;
desc.Height = h;
desc.MipLevels = 1;
desc.ArraySize = 1;
desc.Format = format;
// no anti-aliasing
desc.SampleDesc.Count = 1;
desc.SampleDesc.Quality = 0;
desc.Usage = D3D11_USAGE_DYNAMIC;
desc.BindFlags = D3D11_BIND_SHADER_RESOURCE;
desc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
desc.MiscFlags = 0;
// Initial data: first frame
D3D11_SUBRESOURCE_DATA data;
data.pSysMem = player->getBufferPtr();
data.SysMemPitch = 16 * (player->getWidth() / 4);
data.SysMemSlicePitch = 0; // just a 2d texture, no depth
// Init with initial data
hr = g_D3D11Device->CreateTexture2D(&desc, &data, &dxt_d3d_tex);
if (SUCCEEDED(hr) && dxt_d3d_tex != 0)
{
DXT_VERBOSE("Succesfully created D3D Texture.");
DXT_VERBOSE("Creating D3D SRV.");
D3D11_SHADER_RESOURCE_VIEW_DESC SRVDesc;
memset(&SRVDesc, 0, sizeof(SRVDesc));
SRVDesc.Format = format;
SRVDesc.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE2D;
SRVDesc.Texture2D.MipLevels = 1;
hr = g_D3D11Device->CreateShaderResourceView(dxt_d3d_tex, &SRVDesc, &textureView);
if (FAILED(hr))
{
dxt_d3d_tex->Release();
return hr;
}
DXT_VERBOSE("Succesfully created D3D SRV.");
}
else
{
DXT_ERROR("Error creating D3D texture.")
}
}
The following update code that runs for each new frame has the error somewhere. Please note the commented line containing method 1 using a simple memcpy without any rowpitch specified which works well on NVIDIA drivers.
You can see further in method 2 that I log the different row pitch values. For instace for a 1920x960 frame I get 1920 for the buffer stride, and 2048 for the runtime stride. This 128 pixels difference probably have to be padded (as can be seen in the example pic below) but I can't figure out how. When I just use the mappedResource.RowPitch without dividing it by 4 (done by the bitshift), Unity crashes.
ID3D11DeviceContext* ctx = NULL;
g_D3D11Device->GetImmediateContext(&ctx);
if (dxt_d3d_tex && bShouldUpload)
{
if (player->gather_stats) before_upload = ns();
D3D11_MAPPED_SUBRESOURCE mappedResource;
ctx->Map(dxt_d3d_tex, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
/* 1: THIS CODE WORKS ON ALL NVIDIA DRIVERS BUT GENERATES DISTORTED OR NO OUTPUT ON AMD/INTEL: */
//memcpy(mappedResource.pData, player->getBufferPtr(), player->getBytesPerFrame());
/* 2: THIS CODE GENERATES OUTPUT BUT SEEMS TO NEED PADDING? */
BYTE* mappedData = reinterpret_cast<BYTE*>(mappedResource.pData);
BYTE* buffer = player->getBufferPtr();
UINT height = player->getHeight();
UINT buffer_stride = player->getBytesPerFrame() / player->getHeight();
UINT runtime_stride = mappedResource.RowPitch >> 2;
DXT_VERBOSE("Buffer stride: %d", buffer_stride);
DXT_VERBOSE("Runtime stride: %d", runtime_stride);
for (UINT i = 0; i < height; ++i)
{
memcpy(mappedData, buffer, buffer_stride);
mappedData += runtime_stride;
buffer += buffer_stride;
}
ctx->Unmap(dxt_d3d_tex, 0);
}
Example pic 1 - distorted ouput when using memcpy to copy whole buffer without using separate row pitch on AMD/INTEL (method 1)
Example pic 2 - better but still erroneous output when using above code with mappedResource.RowPitch on AMD/INTEL (method 2). The blue bars indicate zone of error, and need to disappear so all pixels align well and form one image.
Thanks for any pointers!
Best,
Vincent
The mapped data row pitch is in byte, when you divide by four, it is definitely an issue.
UINT runtime_stride = mappedResource.RowPitch >> 2;
...
mappedData += runtime_stride; // here you are only jumping one quarter of a row
It is the height count with a BC format that is divide by 4.
Also a BC1 format is 8 bytes per 4x4 block, so the line below should by 8 * and not 16 *, but as long as you handle row stride properly on your side, d3d will understand, you just waste half the memory here.
data.SysMemPitch = 16 * (player->getWidth() / 4);

NVIDIA CUDA Video Encoder (NVCUVENC) input from device texture array

I am modifying CUDA Video Encoder (NVCUVENC) encoding sample found in SDK samples pack so that the data comes not from external yuv files (as is done in the sample ) but from cudaArray which is filled from texture.
So the key API method that encodes the frame is:
int NVENCAPI NVEncodeFrame(NVEncoder hNVEncoder, NVVE_EncodeFrameParams *pFrmIn, unsigned long flag, void *pData);
If I get it right the param :
CUdeviceptr dptr_VideoFrame
is supposed to pass the data to encode.But I really haven't understood how to connect it with some texture data on GPU.The sample source code is very vague about it as it works with CPU yuv files input.
For example in main.cpp , lines 555 -560 there is following block:
// If dptrVideoFrame is NULL, then we assume that frames come from system memory, otherwise it comes from GPU memory
// VideoEncoder.cpp, EncodeFrame() will automatically copy it to GPU Device memory, if GPU device input is specified
if (pCudaEncoder->EncodeFrame(efparams, dptrVideoFrame, cuCtxLock) == false)
{
printf("\nEncodeFrame() failed to encode frame\n");
}
So ,from the comment, it seems like dptrVideoFrame should be filled with yuv data coming from device to encode the frame.But there is no place where it is explained how to do so.
UPDATE:
I would like to share some findings.First , I managed to encode data from Frame Buffer texture.The problem now is that the output video is a mess.
That is the desired result:
Here is what I do :
On OpenGL side I have 2 custom FBOs-first gets the scene rendered normally into it .Then the texture from the first FBO is used to render screen quad into second FBO doing RGB -> YUV conversion in the fragment shader.
The texture attached to second FBO is mapped then to CUDA resource.
Then I encode the current texture like this:
void CUDAEncoder::Encode(){
NVVE_EncodeFrameParams efparams;
efparams.Height = sEncoderParams.iOutputSize[1];
efparams.Width = sEncoderParams.iOutputSize[0];
efparams.Pitch = (sEncoderParams.nDeviceMemPitch ? sEncoderParams.nDeviceMemPitch : sEncoderParams.iOutputSize[0]);
efparams.PictureStruc = (NVVE_PicStruct)sEncoderParams.iPictureType;
efparams.SurfFmt = (NVVE_SurfaceFormat)sEncoderParams.iSurfaceFormat;
efparams.progressiveFrame = (sEncoderParams.iSurfaceFormat == 3) ? 1 : 0;
efparams.repeatFirstField = 0;
efparams.topfieldfirst = (sEncoderParams.iSurfaceFormat == 1) ? 1 : 0;
if(_curFrame > _framesTotal){
efparams.bLast=1;
}else{
efparams.bLast=0;
}
//----------- get cuda array from the texture resource -------------//
checkCudaErrorsDrv(cuGraphicsMapResources(1,&_cutexResource,NULL));
checkCudaErrorsDrv(cuGraphicsSubResourceGetMappedArray(&_cutexArray,_cutexResource,0,0));
/////////// copy data into dptrvideo frame //////////
// LUMA based on CUDA SDK sample//////////////
CUDA_MEMCPY2D pcopy;
memset((void *)&pcopy, 0, sizeof(pcopy));
pcopy.srcXInBytes = 0;
pcopy.srcY = 0;
pcopy.srcHost= NULL;
pcopy.srcDevice= 0;
pcopy.srcPitch =efparams.Width;
pcopy.srcArray= _cutexArray;///SOME DEVICE ARRAY!!!!!!!!!!!!! <--------- to figure out how to fill this.
/// destination //////
pcopy.dstXInBytes = 0;
pcopy.dstY = 0;
pcopy.dstHost = 0;
pcopy.dstArray = 0;
pcopy.dstDevice=dptrVideoFrame;
pcopy.dstPitch = sEncoderParams.nDeviceMemPitch;
pcopy.WidthInBytes = sEncoderParams.iInputSize[0];
pcopy.Height = sEncoderParams.iInputSize[1];
pcopy.srcMemoryType=CU_MEMORYTYPE_ARRAY;
pcopy.dstMemoryType=CU_MEMORYTYPE_DEVICE;
// CHROMA based on CUDA SDK sample/////
CUDA_MEMCPY2D pcChroma;
memset((void *)&pcChroma, 0, sizeof(pcChroma));
pcChroma.srcXInBytes = 0;
pcChroma.srcY = 0;// if I uncomment this line I get error from cuda for incorrect value.It does work in CUDA SDK original sample SAMPLE//sEncoderParams.iInputSize[1] << 1; // U/V chroma offset
pcChroma.srcHost = NULL;
pcChroma.srcDevice = 0;
pcChroma.srcArray = _cutexArray;
pcChroma.srcPitch = efparams.Width >> 1; // chroma is subsampled by 2 (but it has U/V are next to each other)
pcChroma.dstXInBytes = 0;
pcChroma.dstY = sEncoderParams.iInputSize[1] << 1; // chroma offset (srcY*srcPitch now points to the chroma planes)
pcChroma.dstHost = 0;
pcChroma.dstDevice = dptrVideoFrame;
pcChroma.dstArray = 0;
pcChroma.dstPitch = sEncoderParams.nDeviceMemPitch >> 1;
pcChroma.WidthInBytes = sEncoderParams.iInputSize[0] >> 1;
pcChroma.Height = sEncoderParams.iInputSize[1]; // U/V are sent together
pcChroma.srcMemoryType = CU_MEMORYTYPE_ARRAY;
pcChroma.dstMemoryType = CU_MEMORYTYPE_DEVICE;
checkCudaErrorsDrv(cuvidCtxLock(cuCtxLock, 0));
checkCudaErrorsDrv( cuMemcpy2D(&pcopy));
checkCudaErrorsDrv( cuMemcpy2D(&pcChroma));
checkCudaErrorsDrv(cuvidCtxUnlock(cuCtxLock, 0));
//=============================================
// If dptrVideoFrame is NULL, then we assume that frames come from system memory, otherwise it comes from GPU memory
// VideoEncoder.cpp, EncodeFrame() will automatically copy it to GPU Device memory, if GPU device input is specified
if (_encoder->EncodeFrame(efparams, dptrVideoFrame, cuCtxLock) == false)
{
printf("\nEncodeFrame() failed to encode frame\n");
}
checkCudaErrorsDrv(cuGraphicsUnmapResources(1, &_cutexResource, NULL));
// computeFPS();
if(_curFrame > _framesTotal){
_encoder->Stop();
exit(0);
}
_curFrame++;
}
I set Encoder params from the .cfg files included with CUDA SDK Encoder sample.So here I use 704x480-h264.cfg setup .I tried all of them and getting always similarly ugly result.
I suspect the problem is somewhere in CUDA_MEMCPY2D for luma and chroma objects params setup .May be wrong pitch , width ,height dimensions.I set the viewport the same size as the video (704,480) and compared params to those used in CUDA SDK sample but got no clue where the problem is.
Anyone ?
First: I messed around with Cuda Video Encoder, and had lots of troubles to. But it Looks to me as if you convert it to Yuv values, but as a one on one Pixel conversion (like AYUV 4:4:4). Afaik you need the correct kind of YUV with padding and compression (color values for more than one Pixel like 4:2:0). A good overview of YUV-alignments can be seen here:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd206750(v=vs.85).aspx
As far as I remember you have to use NV12 alignment for Cuda Encoder.
nvEncoder application is used for codec conversion,for processing over GPU its used cuda and communicating with hardware it use API of nvEncoder.
in this application logic is read yuv data in input buffer and store that content in memory and then start encoding the frames.
and parallel write the encoding frame in to output file.
Handling of input buffer is available in nvRead function and it is available in nvFileIO.h
any other help required leave a message here...