Obtaining decoder MFT for H.264 video - c++

i am trying to get a hardware decoder from media foundation. i know for sure my gpu supports nvdec hardware decoding. i found an example on github which gets the encoder, nvenc without any problem. but when i switch the params to decoder, i either get a bad hresult or a crash. i tried even getting a software decoder by changing the hardware flag, and still bad hresult. any one have an idea what is wrong? i cant think of anything else left for me to try or change
HRESULT get_decoder(CComPtr<IMFTransform>& out_transform, CComPtr<IMFActivate>& out_activate,
CComPtr<IMFAttributes>& out_attributes)
{
HRESULT hr = S_OK;
// Find the decoder
CComHeapPtr<IMFActivate*> activate_raw;
uint32_t activateCount = 0;
// Input & output types
const MFT_REGISTER_TYPE_INFO in_info = { MFMediaType_Video, MFVideoFormat_H264 };
const MFT_REGISTER_TYPE_INFO out_info = { MFMediaType_Video, MFVideoFormat_NV12 };
// Get decoders matching the specified attributes
if (FAILED(hr = MFTEnum2(MFT_CATEGORY_VIDEO_DECODER, MFT_ENUM_FLAG_SYNCMFT | MFT_ENUM_FLAG_SORTANDFILTER, &in_info, &out_info,
nullptr, &activate_raw, &activateCount)))
return hr;
// Choose the first returned decoder
out_activate = activate_raw[0];
// Memory management
for (int i = 1; i < activateCount; i++)
activate_raw[i]->Release();
// Activate
if (FAILED(hr = out_activate->ActivateObject(IID_PPV_ARGS(&out_transform))))
return hr;
// Get attributes
if (FAILED(hr = out_transform->GetAttributes(&out_attributes)))
return hr;
std::cout << "- get_decoder() Found " << activateCount << " decoders" << std::endl;
return hr;
}

There might be no dedicated decoder MFT for hardware decoding (even though some vendors supply those). Hardware video decoding, in contrast to encoding, is available via DXVA 2 API, and - in turn - is covered by Microsoft H264 Video Decoder MFT.
This stock MFT is capable to decode using hardware and is also compatible with D3D9 and D3D11 enabled pipelines.
Microsoft H264 Video Decoder MFT
6 Attributes:
MFT_TRANSFORM_CLSID_Attribute: {62CE7E72-4C71-4D20-B15D-452831A87D9D} (Type VT_CLSID, CLSID_CMSH264DecoderMFT)
MF_TRANSFORM_FLAGS_Attribute: MFT_ENUM_FLAG_SYNCMFT
MFT_INPUT_TYPES_Attributes: MFVideoFormat_H264, MFVideoFormat_H264_ES
MFT_OUTPUT_TYPES_Attributes: MFVideoFormat_NV12, MFVideoFormat_YV12, MFVideoFormat_IYUV, MFVideoFormat_I420, MFVideoFormat_YUY2
Attributes
MF_SA_D3D_AWARE: 1 (Type VT_UI4)
MF_SA_D3D11_AWARE: 1 (Type VT_UI4)
CODECAPI_AVDecVideoThumbnailGenerationMode: 0 (Type VT_UI4)
CODECAPI_AVDecVideoMaxCodedWidth: 7680 (Type VT_UI4)
CODECAPI_AVDecVideoMaxCodedHeight: 4320 (Type VT_UI4)
CODECAPI_AVDecNumWorkerThreads: 4294967295 (Type VT_UI4, -1)
CODECAPI_AVDecVideoAcceleration_H264: 1 (Type VT_UI4)
...
From MSDN:
CODECAPI_AVDecVideoAcceleration_H264 Enables or disables hardware acceleration.
...
Maximum Resolution 4096 × 2304 pixels
The maximum guaranteed resolution for DXVA acceleration is 1920 × 1088 pixels; at higher resolutions, decoding is done with DXVA, if it is supported by the underlying hardware, otherwise, decoding is done with software.
...
DXVA The decoder supports DXVA version 2, but not DXVA version 1. DXVA decoding is supported only for Main-compatible Baseline, Main, and High profile bitstreams. (Main-compatible Baseline bitstreams are defined as profile_idc=66 and constrained_set1_flag=1.)
To decode with hardware acceleration just use Microsoft H264 Video Decoder MFT.

Related

Trying to use a MFT in Media Foundation Encoding

The idea is to use a Media Foundation Transform, such as the Video Stabilization MFT while transcoding a video with Media Foundation.
When not using an MFT, the code works fine.
Create IMFSourceReader for the source file - OK
Create IMFSinkWriter for the target file - OK
Add a stream to the writer describing the Video - OK
Add the audio stream - OK
Set input types for video and audio - OK
Loop to read samples and send them to the sink writer, OK.
When using the MFT, these are the facts. To create the MFT (error checking removed):
CComPtr<IMFTransform> trs;
trs.CoCreateInstance(CLSID_CMSVideoDSPMFT);
std::vector<DWORD> iids;
std::vector<DWORD> oods;
DWORD is = 0, os = 0;
hr = trs->GetStreamCount(&is, &os);
iids.resize(is);
oods.resize(os);
hr = trs->GetStreamIDs(is, iids.data(), os, oods.data());
CComPtr<IMFMediaType> ptype;
CComPtr<IMFMediaType> ptype2;
MFCreateMediaType(&ptype);
MFCreateMediaType(&ptype2);
SourceVideoType->CopyAllItems(ptype);
SourceVideoType->CopyAllItems(ptype2);
ptype->SetUINT32(MF_VIDEODSP_MODE, MFVideoDSPMode_Stabilization);
// LogMediaType(ptype);
ptype2->SetUINT32(MF_VIDEODSP_MODE, MFVideoDSPMode_Stabilization);
// LogMediaType(ptype2);
hr = trs->SetInputType(iids[0], ptype, 0);
auto hr2 = trs->SetOutputType(oods[0], ptype2, 0);
if (SUCCEEDED(hr) && SUCCEEDED(hr2))
{
VideoStabilizationMFT = trs;
}
This code works - the MFT is successfully configured. However, in my sample processing loop:
// pSample = sample got from the reader
CComPtr<IMFSample> pSample2;
LONGLONG dur = 0, tim = 0;
pSample->GetSampleDuration(&dur);
pSample->GetSampleTime(&tim);
trs->ProcessInput(0, pSample, 0);
MFT_OUTPUT_STREAM_INFO si = {};
trs->GetOutputStreamInfo(0, &si);
// Create pSample2
MFCreateSample(&pSample2);
CComPtr<IMFMediaBuffer> bb;
MFCreateMemoryBuffer(si.cbSize, &bb);
pSample2->AddBuffer(bb);
DWORD st = 0;
hr = trs->ProcessOutput(0, 1, &db, &st);
This last call fails initially with MF_E_TRANSFORM_NEED_MORE_INPUT, I can understand that the MFT needs more than one sample to achieve stabilizization, so I skip this sample for the writer.
When the call succeeds, I get a sample with no time or duration. Even If I set the time and duration manually, the sink writer fails with E_INVALIDARG.
What do I miss?
With this source code I provide, the sink writer returns S_OK :
VideoStabilizationMFT
if Microsoft is reading this, what are these guids from CLSID_CMSVideoDSPMFT ?
Guid : 44A4AB4B-1D0C-4181-9293-E2F37680672E : VT_UI4 = 4
Guid : 8252D735-8CB3-4A2E-A296-894E7B738059 : VT_R8 = 0.869565
Guid : 9B2DEAFE-37EC-468C-90FF-024E22BD6BC6 : VT_UI4 = 0
Guid : B0052692-FC62-4F21-A1DD-B9DFE1CEB9BF : VT_R8 = 0.050000
Guid : C8DA7888-14AA-43AE-BDF2-BF9CC48E12BE : VT_UI4 = 4
Guid : EF77D08F-7C9C-40F3-9127-96F760903367 : VT_UI4 = 0
Guid : F67575DF-EA5C-46DB-80C4-CEB7EF3A1701 : VT_UI4 = 1
Microsoft, are you serious ?
according to this documentation : https://learn.microsoft.com/en-us/windows/win32/medfound/video-stabilization-mft
On Win10 :
Video Stabilization MFT is MF_SA_D3D11_AWARE, the documentation does not speak about this
Video Stabilization MFT can fallback to software mode, the documentation does not speak about this (see MF_SA_D3D11_AWARE)
Video Stabilization MFT has dynamic format change, the MFT_SUPPORT_DYNAMIC_FORMAT_CHANGE is not present on IMFTransform::GetAttributes
Video Stabilization MFT implements IMFGetService/IMFRealTimeClientEx/IMFShutdown, it is not in the documentation
Video Stabilization MFT only handles MFVideoFormat_NV12, the documentation speak about MEDIASUBTYPE_YUY2
The documentation tells to include Camerauicontrol.h, seriously...
Having said that, this MTF is really good doing stabilization...
This is strange, you set attribute on IMFMediaType, not on Video Stabilization MFT :
ptype->SetUINT32(MF_VIDEODSP_MODE, MFVideoDSPMode_Stabilization);
ptype2->SetUINT32(MF_VIDEODSP_MODE, MFVideoDSPMode_Stabilization);
Should be :
Call IMFTransform::GetAttributes on the video stabilization MFT to get an IMFAttributes pointer.
Call IMFAttributes::SetUINT32 to set the attribute.
MF_VIDEODSP_MODE attribute

Windows MFT (Media Foundation Transform) decoder not returning proper sample time or duration

To decode a H264 stream with the Windows Media foundation Transform, the work flow is currently something like this:
IMFSample sample;
sample->SetTime(time_in_ns);
sample->SetDuration(duration_in_ns);
sample->AddBuffer(buffer);
// Feed IMFSample to decoder
mDecoder->ProcessInput(0, sample, 0);
// Get output from decoder.
/* create outputsample that will receive content */ { ... }
MFT_OUTPUT_DATA_BUFFER output = {0};
output.pSample = outputsample;
DWORD status = 0;
HRESULT hr = mDecoder->ProcessOutput(0, 1, &output, &status);
DWORD status = 0;
hr = mDecoder->ProcessOutput(0, 1, &output, &status);
if (output.pEvents) {
// We must release this, as per the IMFTransform::ProcessOutput()
// MSDN documentation.
output.pEvents->Release();
output.pEvents = nullptr;
}
if (hr == MF_E_TRANSFORM_STREAM_CHANGE) {
// Type change, probably geometric aperture change.
// Reconfigure decoder output type, so that GetOutputMediaType()
} else if (hr == MF_E_TRANSFORM_NEED_MORE_INPUT) {
// Not enough input to produce output.
} else if (!output.pSample) {
return S_OK;
} else }
// Process output
}
}
When we have fed all data to the MFT decoder, we must drain it:
mDecoder->ProcessMessage(MFT_MESSAGE_COMMAND_DRAIN, 0);
Now, one thing with the WMF H264 decoder, is that it will typically not output anything before having been called with over 30 compressed h264 frames regardless of the size of the h264 sliding window. Latency is very high...
I'm encountering an issue that is very troublesome.
With a video made only of keyframes, and which has only 15 frames, each being 2s long, the first frame having a presentation time of non-zero (this stream is from live content, so first frame is typically in epos time)
So without draining the decoder, nothing will come out of the decoder as it hasn't received enough frames.
However, once the decoder is drained, the decoded frame will come out. HOWEVER, the MFT decoder has set all durations to 33.6ms only and the presentation time of the first sample coming out is always 0.
The original duration and presentation time have been lost.
If you provide over 30 frames to the h264 decoder, then both duration and pts are valid...
I haven't yet found a way to get the WMF decoder to output samples with the proper value.
It appears that if you have to drain the decoder before it has output any samples by itself, then it's totally broken...
Has anyone experienced such problems? How did you get around it?
Thank you in advance
Edit: a sample of the video is available on http://people.mozilla.org/~jyavenard/mediatest/fragmented/1301869.mp4
Playing this video with Firefox will causes it to play extremely quickly due to the problems described above.
I'm not sure that your work flow is correct. I think you should do something like this:
do
{
...
hr = mDecoder->ProcessInput(0, sample, 0);
if(FAILED(hr))
break;
...
hr = mDecoder->ProcessOutput(0, 1, &output, &status);
if(FAILED(hr) && hr != MF_E_TRANSFORM_NEED_MORE_INPUT)
break;
}
while(hr == MF_E_TRANSFORM_NEED_MORE_INPUT);
if(SUCCEEDED(hr))
{
// You have a valid decoded frame here
}
The idea is to keep calling ProcessInput/ProcessOuptut while ProcessOutput returns MF_E_TRANSFORM_NEED_MORE_INPUT. MF_E_TRANSFORM_NEED_MORE_INPUT means that decoder needs more input. I think that with this loop you won't need to drain the decoder.

ALSA - samplerate conversion

I have a text-to-speech application, that generate an audio-stream (raw-data) with a samplerate of 22kHz.
I have a USB-SoundCard that support only 44kHz.
With my asound.conf I can play wav-files that contains 22kHz and 44kHz audio-stream without problems in aplay.
My Application use the alsa-libs and set the samplerate of the device.
In this case only 44kHz will succeed, because the hardware supports only this samplerate. But now, when i write the generated audio-stream to alsa, it sounds wrong, because the samplerates dosn't match. The audio-stream (raw-data) dosn't contain any header information, so I think alsa don't use any plugin zu convert the samplerate. Alsa don't know that the stream has a different samplerate
My question is now, what is the right way to tell alsa, that the generated audio-stream have a different samplerate, so the alsa-plugin convert it.
The following code works on the USB SoundCard only with sampleRate = 44100, otherwise an error occured (-22, invalid parameters).
void initAlsa()
{
const char* name = "default";
alsaAudio = true;
writeRiffAtClose = false;
int err = snd_pcm_open (&alsaPlaybackHandle, name, SND_PCM_STREAM_PLAYBACK, 0);
if (err < 0)
throw TtsException({"Alsa: cannot open playback audio device ", name, " (" , snd_strerror (err), ")"}, 0);
sampleRate = 44100;
err = snd_pcm_set_params(alsaPlaybackHandle, // pcm PCM handle
SND_PCM_FORMAT_S16_LE, // format required PCM format
SND_PCM_ACCESS_RW_INTERLEAVED, // access required PCM access
2, // channels required PCM channels (Stereo)
sampleRate, // rate required sample rate in Hz
1, // soft_resample 0 = disallow alsa-lib resample stream, 1 = allow resampling
250000); /* 0.25sec */ // latency required overall latency in us
if (err < 0)
throw TtsException({"Alsa: cannot set parameters (" , err, " = " , snd_strerror(err), ") on ", name}, 0);
LOG_DEBUG("Alsa audio initialized");
}
Other way are, I manuelly convert the sample rate, before i put it to alsa, but i think: why not use the alsa-plugin.
I don't have the option to get 44kHz audio-stream from the tts-engine (it's another software).
Or exist another way, that I don't see?
Best regards.

Media Foundation onReadSample wrong size of returned sample

I am working on translating a capture library from DirectShow to MediaFoundation. The capture library seemed to work quite well but I face a problem with an integrated webcam on a tablet running Windows 8 32 bit.
When enumerating the capture format (as explained in Media Foundation documentation), I got the following supported format for the camera:
0 : MFVideoFormat_NV12, resolution : 448x252, framerate : 30000x1001
1 : MFVideoFormat_YUY2, resolution : 448x252, framerate : 30000x1001
2 : MFVideoFormat_NV12, resolution : 640x360, framerate : 30000x1001
3 : MFVideoFormat_YUY2, resolution : 640x360, framerate : 30000x1001
4 : MFVideoFormat_NV12, resolution : 640x480, framerate : 30000x1001
5 : MFVideoFormat_YUY2, resolution : 640x480, framerate : 30000x1001
I then set the capture format, in this case the one at index 5, using the following function, as described in the example:
hr = pHandler->SetCurrentMediaType(pType);
This function executed without error. The camera should thus be configured to capture in YUY2 with a resolution of 640*480.
In the onReadSample callback, I should receive a sample with a buffer of size :
640 * 480 * sizeof(unsigned char) * 2 = 614400 //YUY2 is encoded on 2 bytes
However, I got a sample with a buffer of size 169344. Here below is a part of the callback function.
HRESULT SourceReader::OnReadSample(
HRESULT hrStatus,
DWORD dwStreamIndex,
DWORD dwStreamFlags,
LONGLONG llTimeStamp,
IMFSample *pSample // Can be NULL
)
{
EnterCriticalSection(&m_critsec);
if (pSample)
{
DWORD expectedBufferSize = 640*480*1*2; // = 614400 (hard code for the example)
IMFMediaBuffer* buffer = NULL;
hr = pSample->ConvertToContiguousBuffer(&buffer);
if (FAILED(hr))
{
//...
goto done;
}
DWORD byteLength = 0;
BYTE* pixels = NULL;
hr = buffer->Lock(&pixels, NULL, &byteLength);
//byteLength is 169344 instead of 614400
if (byteLength > 0 && byteLength == expectedBufferSize)
{
//do someting with the image, but never comes here because byteLength is wrong
}
//...
Any advice why I get a sample of size 169344 ?
Thanks in advance
Thanks Mgetz for your answer.
I checked the value of MF_MT_INTERLACE_MODE of the media type and it appears that the video stream contains progressive frames. The value of MF_MT_INTERLACE_MODE returns MFVideoInterlace_Progressive.
hr = pHandler->SetCurrentMediaType(m_pType);
if(FAILED(hr)){
//
}
else
{
//get info about interlacing
UINT32 interlaceFormat = MFVideoInterlace_Unknown;
m_pType->GetUINT32(MF_MT_INTERLACE_MODE, &interlaceFormat);
//...
So the video stream is not interlaced. I checked again in the onReadSample the value of MFSampleExtension_Interlaced to see if the sample is interlaced or not and it appears that the sample is interlaced.
if (pSample && m_bCapture)
{
//check if interlaced
UINT32 isSampleInterlaced = 0;
pSample->GetUINT32(MFSampleExtension_Interlaced, &isSampleInterlaced);
if(isSampleInterlaced)
{
//enters here
}
How it is possible that the stream is progressive and that the sample is interlaced? I double checked the value of MF_MT_INTERLACE_MODE in the onReadSample callback as well and it still gives me the value MFT_INPUT_STREAM_WHOLE_SAMPLES.
Concerning your first suggestion, I didn't way to force the flag MFT_INPUT_STREAM_WHOLE_SAMPLES on the input stream.
Thanks in advance
I still face the issue and I am now investigating on the different streams available.
According to the documentation, each media source provides a presentation descriptor from which we can get the streams available. To get the presentation descriptor, we have to call:
HRESULT hr = pSource->CreatePresentationDescriptor(&pPD);
I then request the streams available using the IMFPresentationDescriptor::GetStreamDescriptorCount function:
DWORD nbrStream;
pPD->GetStreamDescriptorCount(&nbrStream);
When requesting this information on the frontal webcam on an ACER tablet running windows 8, I got that three streams are available. I looped over these streams, requested their MediaTypeHandler and checked the MajorType. The three streams have for major type : MFMediaType_Video, so all the streams are video streams. When listing the media type available on the different streams, I got that all the streams support capture at 640x480. (some of the streams have more available media types).
I tested to select each of the different streams and the appropriate format type (the framework did not return any error), but I still do not receive the correct sample in the callback function...
Any advice to progress on the issue?
Finally found the issue: I had to set the media type on the source reader directly, using SourceReader->SetCurrentMediaType(..). That did the trick!
Thanks for your help!
Without knowing what the input media type descriptor is we can largely only speculate, but the most likely answer is you are saying you can handle the stream even though MFT_INPUT_STREAM_WHOLE_SAMPLES is not set on the input stream.
The next most likely cause is interlacing in which case each frame would be complete but not full resolution which you are assuming. Regardless you should verify the ENTIRE media type descriptor before accepting it.
Finally found the issue: I had to set the media type on the source reader directly, using SourceReader->SetCurrentMediaType(..). That did the trick!
Thanks for your help!

Filling CMediaType and IMediaSample from AVPacket for h264 video

I have searched and have found almost nothing, so I would really appreciate some help with my question.
I am writting a DirectShow source filter which uses libav to read and send downstream h264 packets from youtube's FLV file. But I can't find appropriate libav structure's fields to implement correctly filter's GetMediType() and FillBuffer(). Some libav fields is null. In consequence h264 decoder crashes in attempt to process received data.
Where am I wrong? In working with libav or with DirectShow interfaces? Maybe h264 requires additional processing when working with libav or I fill reference time incorrectly? Does someone have any links useful for writing DirectShow h264 source filter with libav?
Part of GetMediaType():
VIDEOINFOHEADER *pvi = (VIDEOINFOHEADER*) toMediaType->AllocFormatBuffer(sizeof(VIDEOINFOHEADER));
pvi->AvgTimePerFrame = UNITS_PER_SECOND / m_pFormatContext->streams[m_streamNo]->codec->sample_rate; //sample_rate is 0
pvi->dwBitRate = m_pFormatContext->bit_rate;
pvi->rcSource = videoRect;
pvi->rcTarget = videoRect;
//Bitmap
pvi->bmiHeader.biSize = sizeof(BITMAPINFOHEADER);
pvi->bmiHeader.biWidth = videoRect.right;
pvi->bmiHeader.biHeight = videoRect.bottom;
pvi->bmiHeader.biPlanes = 1;
pvi->bmiHeader.biBitCount = m_pFormatContext->streams[m_streamNo]->codec->bits_per_raw_sample;//or should here be bits_per_coded_sample
pvi->bmiHeader.biCompression = FOURCC_H264;
pvi->bmiHeader.biSizeImage = GetBitmapSize(&pvi->bmiHeader);
Part of FillBuffer():
//Get buffer pointer
BYTE* pBuffer = NULL;
if (pSamp->GetPointer(&pBuffer) < 0)
return S_FALSE;
//Get next packet
AVPacket* pPacket = m_mediaFile.getNextPacket();
if (pPacket->data == NULL)
return S_FALSE;
//Check packet and buffer size
if (pSamp->GetSize() < pPacket->size)
return S_FALSE;
//Copy from packet to sample buffer
memcpy(pBuffer, pPacket->data, pPacket->size);
//Set media sample time
REFERENCE_TIME start = m_mediaFile.timeStampToReferenceTime(pPacket->pts);
REFERENCE_TIME duration = m_mediaFile.timeStampToReferenceTime(pPacket->duration);
REFERENCE_TIME end = start + duration;
pSamp->SetTime(&start, &end);
pSamp->SetMediaTime(&start, &end);
P.S. I've debugged my filter with hax264 decoder and it crashes on call to libav deprecated function img_convert().
Here is the MSDN link you need to build a correct H.264 media type: H.264 Video Types
You have to fill the right fields with the right values.
The AM_MEDIA_TYPE should contain the right MEDIASUBTYPE for h264.
And these are plain wrong :
pvi->bmiHeader.biWidth = videoRect.right;
pvi->bmiHeader.biHeight = videoRect.bottom;
You should use a width/height which is independent of the rcSource/rcTarget, due to the them being indicators, and maybe completely zero if you take them from some other filter.
pvi->bmiHeader.biBitCount = m_pFormatContext->streams[m_streamNo]->codec->bits_per_raw_sample;//or should here be bits_per_coded_sample
This only makes sense if biWidth*biHeight*biBitCount/8 are the true size of the sample. I do not think so ...
pvi->bmiHeader.biCompression = FOURCC_H264;
This must also be passed in the AM_MEDIA_TYPE in the subtype parameter.
pvi->bmiHeader.biSizeImage = GetBitmapSize(&pvi->bmiHeader);
This fails, because the fourcc is unknown to the function and the bitcount is plain wrong for this sample, due to not being a full frame.
You have to take a look at how the data stream is handled by the downstream h264 filter. This seems to be flawed.