Windows Media Foundation How to use IMFMediaSink::AddStreamSink - c++

I am implementing sample application using Windows Media Foundation.
I have created one example application as described in below link:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms703190(v=vs.85).aspx
In the above example I have added two video streams using MFCreateAggregateSource.
In the EVR renderer I am able to hear audio of both the videos but I am not able to see only one reference stream video or which is first loaded.
As per the below link,
https://msdn.microsoft.com/en-us/library/windows/desktop/aa965265(v=vs.85).aspx
The EVR media sink initially has one stream sink, which corresponds to the reference stream. To add new stream sinks, call IMFMediaSink::AddStreamSink.
In my application I am using MFCreateVideoRendererActivate.
How can I use IMFMediaSink::AddStreamSink to add streams to my EVR.
So that I can see two video stream playing in one renderer.
** Update **
I have modified below example code and added code to add video
HRESULT CreateMediaSinkActivate(
IMFStreamDescriptor *pSourceSD, // Pointer to the stream descriptor.
DWORD iStream,
HWND hVideoWindow, // Handle to the video clipping window.
IMFActivate **ppActivate
)
{
IMFMediaTypeHandler *pHandler = NULL;
IMFActivate *pActivate = NULL;
// Get the media type handler for the stream.
HRESULT hr = pSourceSD->GetMediaTypeHandler(&pHandler);
if (FAILED(hr))
{
goto done;
}
// Get the major media type.
GUID guidMajorType;
hr = pHandler->GetMajorType(&guidMajorType);
if (FAILED(hr))
{
goto done;
}
// Create an IMFActivate object for the renderer, based on the media type.
if (MFMediaType_Audio == guidMajorType)
{
// Create the audio renderer.
hr = MFCreateAudioRendererActivate(&pActivate);
}
else if (MFMediaType_Video == guidMajorType) // Added this else if case
{
// Create the video renderer.
hr = MFCreateVideoRendererActivate(hVideoWindow, &pActivate);
IMFMediaSink* pVideoSink = NULL;
HRESULT hrMS = pActivate->ActivateObject(IID_IMFMediaSink, (void**)&pVideoSink);
if (SUCCEEDED(hrMS))
{
IMFStreamSink* pStreamSink = NULL;
hrMS = pVideoSink->AddStreamSink(iStream, NULL, &pStreamSink);
if (SUCCEEDED(hrMS))
{
DWORD dwID=10;
hrMS = pStreamSink->GetIdentifier(&dwID);
if (SUCCEEDED(hrMS))
{
printf("\n%d", dwID);
SafeRelease(&pStreamSink);
}
}
}
}
else
{
// Unknown stream type.
hr = E_FAIL;
// Optionally, you could deselect this stream instead of failing.
}
if (FAILED(hr))
{
goto done;
}
// Return IMFActivate pointer to caller.
*ppActivate = pActivate;
(*ppActivate)->AddRef();
done:
SafeRelease(&pHandler);
SafeRelease(&pActivate);
return hr;
}
But problem is that I am not able to see two video stream in the video window.

Related

Implement 'IMFTransform' to encode or decode H264 or AAC

Can IMFTransform interface be implemented to encode or decode H264 or AAC data or should I use FFmpeg or OpenH264.
When you encode or decode media, IMFTransform is the interface codecs expose in Media Foundation API. That is, you don't implement it - you take advantage of existing implementation of codecs which are available to you (you implement it when you want to extend the API and supply additional codec).
Stock Windows provides you with:
AAC Decoder - CLSID_CMSAACDecMFT
AAC Encoder - CLSID_AACMFTEncoder
H.264 Video Decoder - CLSID_CMSH264DecoderMFT, leverages DXVA hardware-assisted decoding wherever applicable
H.264 Video Encoder - CLSID_CMSH264EncoderMFT, software (fallback) encoder
Additional hardware accelerated encoders might be provided with hardware drivers. All mentioned above are available in the form of IMFTransform, can be consumed directly or using higher level Media Foundation APIs.
You can implement the IMFTransform interface to decode and encode H264 and AAC. Refer to CLSID_CMSH264DecoderMFT and CLSID_CMSAACDecMFT to decode H264 and ACC, also CLSID_CMSH264EncoderMFT and CLSID_AACMFTEncoder to encode H264 and ACC.
Encoder example : initialise the encoder.
IUnknown *_transformUnk;
IMFTransform *_encoder;
HRESULT MediaEncoder::InitialiseEncoder(EncoderType encoder)
{
HRESULT hr = S_OK;
// Has the encoder been init.
if (!_isOpen)
{
_encoderType = encoder;
// Init the COM.
CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);
// Create a new close event handler.
_hCloseEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
// If event was not created.
if (_hCloseEvent == NULL)
{
// Get the result value.
hr = __HRESULT_FROM_WIN32(GetLastError());
}
// If successful creation of the close event.
if (SUCCEEDED(hr))
{
// Start up Media Foundation platform.
hr = MFStartup(MF_VERSION);
_isOpen = true;
}
if (SUCCEEDED(hr))
{
// Select the encoder.
switch (encoder)
{
case Nequeo::Media::Foundation::EncoderType::H264:
// Create the H264 encoder.
hr = CreateEncoder(CLSID_CMSH264EncoderMFT);
break;
case Nequeo::Media::Foundation::EncoderType::AAC:
// Create the AAC encoder.
hr = CreateEncoder(CLSID_AACMFTEncoder);
break;
case Nequeo::Media::Foundation::EncoderType::MP3:
// Create the MP3 encoder.
hr = CreateEncoder(CLSID_MP3ACMCodecWrapper);
break;
default:
hr = ((HRESULT)-1L);
break;
}
}
if (SUCCEEDED(hr))
{
// Query for the IMFTransform interface
hr = _transformUnk->QueryInterface(IID_PPV_ARGS(&_encoder));
// Encoder has been created.
_created = true;
}
}
// Return the result.
return hr;
}
HRESULT MediaEncoder::CreateEncoder(const CLSID encoder)
{
HRESULT hr = S_OK;
// Create the decoder.
hr = CoCreateInstance(encoder, NULL, CLSCTX_INPROC_SERVER, IID_IUnknown, (void**)&_transformUnk);
// Return the result.
return hr;
}
Decoder example : initialise the decoder.
IUnknown *_transformUnk;
IMFTransform *_decoder;
HRESULT MediaDecoder::InitialiseDecoder(DecoderType decoder)
{
HRESULT hr = S_OK;
// Has the decoder been init.
if (!_isOpen)
{
_decoderType = decoder;
// Init the COM.
CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);
// Create a new close event handler.
_hCloseEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
// If event was not created.
if (_hCloseEvent == NULL)
{
// Get the result value.
hr = __HRESULT_FROM_WIN32(GetLastError());
}
// If successful creation of the close event.
if (SUCCEEDED(hr))
{
// Start up Media Foundation platform.
hr = MFStartup(MF_VERSION);
_isOpen = true;
}
if (SUCCEEDED(hr))
{
// Select the decoder.
switch (decoder)
{
case Nequeo::Media::Foundation::DecoderType::H264:
// Create the H264 decoder.
hr = CreateDecoder(CLSID_CMSH264DecoderMFT);
break;
case Nequeo::Media::Foundation::DecoderType::AAC:
// Create the AAC decoder.
hr = CreateDecoder(CLSID_CMSAACDecMFT);
break;
case Nequeo::Media::Foundation::DecoderType::MP3:
// Create the MP3 decoder.
hr = CreateDecoder(CLSID_CMP3DecMediaObject);
break;
case Nequeo::Media::Foundation::DecoderType::MPEG4:
// Create the MPEG4 decoder.
hr = CreateDecoder(CLSID_CMpeg4sDecMFT);
break;
default:
hr = ((HRESULT)-1L);
break;
}
}
if (SUCCEEDED(hr))
{
// Query for the IMFTransform interface
hr = _transformUnk->QueryInterface(IID_PPV_ARGS(&_decoder));
// Decoder has been created.
_created = true;
}
}
// Return the result.
return hr;
}
HRESULT MediaDecoder::CreateDecoder(const CLSID decoder)
{
HRESULT hr = S_OK;
// Create the decoder.
hr = CoCreateInstance(decoder, NULL, CLSCTX_INPROC_SERVER, IID_IUnknown, (void**)&_transformUnk);
// Return the result.
return hr;
}

DirectShow Memory Leak

I'm experiencing an important memory leak in a camera application that uses an EVR (Enhanced Video Renderer). The leak happens when I prepare, run, stop and unprepare a graph several times (several MB every cycle, depending on the video resolution).
I simplified the code as much as I could; you can download it from here:
https://daptech.box.com/s/6csm91vhcawiw18u42kq
The graph has a video capture filter, a SampleGrabber filter and a renderer. When connecting, it creates an AVI Decompressor filter to complete the graph.
Here are some excerpts...
Declarations:
CComPtr<IGraphBuilder> pGraphBuilder;
CComPtr<IBaseFilter> pSource; // The capture device filter
CComPtr<IBaseFilter> pSampleGrabberFilter; // Filter to capture data that flows through the graph
CComPtr<IBaseFilter> pRenderer; // EVR
Graph preparation (validation removed):
hr = m_p->pGraphBuilder.CoCreateInstance(CLSID_FilterGraph, NULL, CLSCTX_INPROC);
hr = m_p->pSampleGrabberFilter.CoCreateInstance(CLSID_SampleGrabber, NULL, CLSCTX_INPROC);
CComQIPtr<ISampleGrabber, &IID_ISampleGrabber> pSampleGrabber(m_p->pSampleGrabberFilter);
hr = pSampleGrabber->SetBufferSamples(FALSE); // Don't copy image data for better performances
// Setup the callback
hr = pSampleGrabber->SetCallback(&m_p->cb, 1); // 0: SampleCB() is called, so that we directly access the buffer. 1: BufferCB() is called. The doc states that the function always makes a copy of the data.
CComPtr<IPin> pSampleGrabberInputPin( FindPin(m_p->pSampleGrabberFilter, PINDIR_INPUT, NULL) );
CComPtr<IPin> pSampleGrabberOutputPin( FindPin(m_p->pSampleGrabberFilter, PINDIR_OUTPUT, NULL) );
hr = m_p->pGraphBuilder->AddFilter(m_p->pSampleGrabberFilter, L"SampleGrabber");
hr = FindVideoCaptureDevice(m_p->pSource, (unsigned)iCameraIndex);
CComPtr<IPin> pSourceOutputPin( FindPin(m_p->pSource, PINDIR_OUTPUT, L"CAPTURE") );
hr = m_p->pGraphBuilder->AddFilter(m_p->pSource, L"Camera");
hr = m_p->pRenderer.CoCreateInstance(CLSID_EnhancedVideoRenderer, NULL, CLSCTX_INPROC);
hr = m_p->pGraphBuilder->AddFilter(m_p->pRenderer, L"Renderer");
hr = ConfigEVR(m_p->pRenderer, m_staticPreview.m_hWnd, uWidth, uHeight);
CComPtr<IPin> pRendererInputPin( FindPin(m_p->pRenderer, PINDIR_INPUT, NULL) );
// Now select the format:
if (!SelectFormat(pSourceOutputPin, uWidth, uHeight, bCompression))
{ MessageBox(L"No appropriate format found.\n"); return; }
// Connection:
hr = m_p->pGraphBuilder->Connect(pSourceOutputPin, pSampleGrabberInputPin);
hr = m_p->pGraphBuilder->Connect(pSampleGrabberOutputPin, pRendererInputPin);
This function is called to "unprepare" the graph:
void FreeDirectShowGraph()
{
if (pSource) DisconnectDownstream(pSource);
RemoveAndRelease(pRenderer);
RemoveAndRelease(pSampleGrabberFilter);
RemoveAndRelease(pSource);
if (pGraphBuilder) pGraphBuilder.Release();
}
Any clue would be appreciated!

Capture preview to Enhanced video renderer

I am trying to basically render a preview from a capture card (720p) from a PS3 to the enhanced video render.
Ideally, I would like something like this:
I used to do this:
hr = m_pCapture->RenderStream (&PIN_CATEGORY_CAPTURE, &MEDIATYPE_Video, m_pSrcFilter, NULL, NULL);
But I find that it only renders to an old default renderer, which is not adequate enough to stretch the image to 1080p (image becomes pixelated). [http://msdn.microsoft.com/en-us/library/aa930715.aspx ]
I want to use the enhanced video render as the sink but I have no idea how to. I viewed the tutorials here: http://msdn.microsoft.com/en-us/library/windows/desktop/ff625867%28v=vs.85%29.aspx
And tried to put my code in but it would not render.
Here is a snippet of the code that sets the source. Assume that setResolution will set the AM_MEDIA_TYPE format and that getVideoSourceByKeyword will get the AVermedia capture card device.
HRESULT DShowPlayer::SetPreviewDevice(PCWSTR keyname)
{
IBaseFilter *pSource = NULL;
// Create a new filter graph. (This also closes the old one, if any.)
HRESULT hr = CoCreateInstance(CLSID_CaptureGraphBuilder2, NULL,
CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&m_pCapture));
if (FAILED(hr))
{
goto done;
}
hr = InitializeGraph();
if (FAILED(hr))
{
goto done;
}
// Add the source filter to the graph.
hr = getVideoSourceByKeyword(keyname, &pSource);
if (FAILED(hr))
{
goto done;
}
hr = m_pGraph->AddFilter(pSource, L"Source filter");
if (FAILED(hr))
{
goto done;
}
setResolution(pSource, 1280, 720);
// Try to render the streams.
hr = RenderStreams(pSource);
if (FAILED(hr))
{
goto done;
}
hr = m_pControl->Run();
done:
if (FAILED(hr))
{
TearDownGraph();
}
SafeRelease(&pSource);
return hr;
}
When the code runs RenderStreams, this is the code (from http://msdn.microsoft.com/en-us/library/windows/desktop/ff625878%28v=vs.85%29.aspx):
// Enumerate the pins on the source filter.
hr = pSource->EnumPins(&pEnum);
if (FAILED(hr))
{
goto done;
}
// Loop through all the pins
IPin *pPin;
while (S_OK == pEnum->Next(1, &pPin, NULL))
{
PIN_INFO pInfo;
pPin->QueryPinInfo(&pInfo);
// Try to render this pin.
// It's OK if we fail some pins, if at least one pin renders.
HRESULT hr2 = pGraph2->RenderEx(pPin, AM_RENDEREX_RENDERTOEXISTINGRENDERERS, NULL);
pPin->Release();
if (SUCCEEDED(hr2))
{
bRenderedAnyPin = TRUE;
}
}
In visual studio I debugged at the pin to get the source name ("Capture" pin name of the AVermedia capture card). It said it was successful to attach to the render at RenderEx however at
hr = m_pControl->Run();
It fails and there error is device is not connected.
I also tried to get the EVR renderer directly and tried to render the stream:
IBaseFilter* render;
m_pVideo->getRender(&render);
m_pGraph->AddFilter(render, L"EVR Filter");
hr = m_pCapture->RenderStream(&PIN_CATEGORY_CAPTURE, &MEDIATYPE_Video, pSource, NULL, render);
if (FAILED(hr))
{
goto done;
}
But it fails and says that VFW_E_NOT_IN_GRAPH.
What I am asking: I am still pretty new at learning Directshow and I would like to be able to preview the capture card with EVR. I found no comprehensive tutorials or source code to do this. If you need anymore information, I can add more.
Thanks in advance.
EVR can be used programmatically very much the same way as VMR-7/9. The only difference is that EVR needs "windowless" mode, while earlier renderers supported also "windowed" mode where you need minimal initialization of the renderer.
I suppose you can see video on EVR in GraphEdit? You should be able to do so, just use Preview pin, not Capture. Or, connect Capture through Smart Tee filter and its preview output.
The error codes suggest that you don't build graph correctly. In particular, VFW_E_NOT_IN_GRAPH says your filter is not in graph and hence invalid argument. You don't need to use getRender, just CoCreateInstance the EVR the usual and straightforward way. At the first moment you get an error you are interested in putting everything on hold and reviewing the filter graph topology you have at the moment.
Windows SDK samples contain \Samples\multimedia\directshow\vmr9\windowless which shows VMR-9 in windowless mode, this is supposedly the closest starting point to just switch from VMR-9 to EVR.

XAudio2 - playing only a part of the audio

I am trying to play mp3/wma using xaudio2. I managed to use the Media Foundation Source Reader object to do the decoding. My problem is, it is not playing the full audio; I could get only a part of the audio played.
What I am trying to do is, get the next sample from IMFSourceReader and submit this as the next buffer of sourcevoice. This is repeated util all the data is read from IMFSourceReader.
while (true)
{
DWORD dwFlags = 0;
// Read the next sample.
hr = pReader->ReadSample(
(DWORD)MF_SOURCE_READER_FIRST_AUDIO_STREAM,
0, NULL, &dwFlags, NULL, &pSample );
if (dwFlags & MF_SOURCE_READERF_CURRENTMEDIATYPECHANGED)
{
printf("Type change - not supported by WAVE file format.\n");
break;
}
if (dwFlags & MF_SOURCE_READERF_ENDOFSTREAM)
{
printf("End of input file.\n");
break;
}
if (pSample == NULL)
{
printf("No sample\n");
continue;
}
// Get a pointer to the audio data in the sample.
hr = pSample->ConvertToContiguousBuffer(&pBuffer);
if (FAILED(hr)) { break; }
hr = pBuffer->Lock(&pAudioData, NULL, &cbBuffer);
if (FAILED(hr)) { break; }
// Make sure not to exceed the specified maximum size.
if (cbMaxAudioData - cbAudioData < cbBuffer)
{
cbBuffer = cbMaxAudioData - cbAudioData;
}
// Write this data to the output file.
hr = WriteToFile(hFile, pAudioData, cbBuffer);
int audioBufferLength = cbBuffer;
if (FAILED(hr)) { break; }
SubmitBuffer(pAudioData, audioBufferLength);
// Unlock the buffer.
hr = pBuffer->Unlock();
pAudioData = NULL;
if (FAILED(hr)) { break; }
// Update running total of audio data.
cbAudioData += cbBuffer;
if (cbAudioData >= cbMaxAudioData)
{
break;
}
SafeRelease(&pSample);
SafeRelease(&pBuffer);
}
void AudioDecoder::SubmitBuffer(byte *pAudioData, int audioBufferLength)
{
byte * pAudioBuffer = new byte[audioBufferLength];
CopyMemory(pAudioBuffer, pAudioData, audioBufferLength);
if (pAudioBuffer != nullptr)
{
// Create an XAUDIO2_BUFFER for submitting audio data
XAUDIO2_BUFFER buffer = {0};
buffer.AudioBytes = audioBufferLength;
buffer.pAudioData = pAudioBuffer;
buffer.pContext = pAudioBuffer;
HRESULT hresult = m_pSourceVoice->SubmitSourceBuffer(&buffer);
}
}
After this I am calling m_pSourceVoice->Start(). This will start the audio, but not playing the full audio. Do I need to add anything else?
This loop doesn't look like it accounts for if any buffers have been completed before submitting more, so could be running into the limit of XAUDIO2_MAX_QUEUED_BUFFERS. Can you create a counter on your while loop to see how many buffers are submitted to the source voice?
If you've hit a limit you could start playback before fully decoding the file and submit additional buffers via source voice callbacks.
http://msdn.microsoft.com/en-us/library/windows/desktop/ee415769(v=vs.85).aspx

Windows Media Foundation recording audio

I'm using the windows media foundation api to enumerate both my microphones and available cameras, which both work.
Here is my enumeration code:
class deviceInput {
public:
deviceInput( REFGUID source );
~deviceInput();
int listDevices(bool refresh = false);
IMFActivate *getDevice(unsigned int deviceId);
const WCHAR *getDeviceName(unsigned int deviceId);
private:
void Clear();
HRESULT EnumerateDevices();
UINT32 m_count;
IMFActivate **m_devices;
REFGUID m_source;
};
deviceInput::deviceInput( REFGUID source )
: m_devices( NULL )
, m_count( 0 )
, m_source( source )
{ }
deviceInput::~deviceInput()
{
Clear();
}
int deviceInput::listDevices(bool refresh)
{
if ( refresh || !m_devices ) {
if ( FAILED(this->EnumerateDevices()) ) return -1;
}
return m_count;
}
IMFActivate *deviceInput::getDevice(unsigned int deviceId)
{
if ( deviceId >= m_count ) return NULL;
IMFActivate *device = m_devices[deviceId];
device->AddRef();
return device;
}
const WCHAR *deviceInput::getDeviceName(unsigned int deviceId)
{
if ( deviceId >= m_count ) return NULL;
HRESULT hr = S_OK;
WCHAR *devName = NULL;
UINT32 length;
hr = m_devices[deviceId]->GetAllocatedString( MF_DEVSOURCE_ATTRIBUTE_FRIENDLY_NAME, &devName, &length );
if ( FAILED(hr) ) return NULL;
return devName;
}
void deviceInput::Clear()
{
if ( m_devices ) {
for (UINT32 i = 0; i < m_count; i++) SafeRelease( &m_devices[i] );
CoTaskMemFree( m_devices );
}
m_devices = NULL;
m_count = 0;
}
HRESULT deviceInput::EnumerateDevices()
{
HRESULT hr = S_OK;
IMFAttributes *pAttributes = NULL;
Clear();
hr = MFCreateAttributes(&pAttributes, 1);
if ( SUCCEEDED(hr) ) hr = pAttributes->SetGUID( MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE, m_source );
if ( SUCCEEDED(hr) ) hr = MFEnumDeviceSources( pAttributes, &m_devices, &m_count );
SafeRelease( &pAttributes );
return hr;
}
To grab audio or camera capture devices, I specify either MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE_AUDCAP_GUID or MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE_VIDCAP_GUID and that works no problem, and I can grab the names of the devices, as well as the IMFActivate. I have code to record the webcam to an output video file, however, I'm having a tough time figuring out how to record the audio to a file. I'm under the impression that I need to use an IMFSinkWriter, but I can't find any examples that use an audio capture IMFActivate and IMFSinkWriter.
I'm not much of a windows api programmer, so I'm sure there's a fairly straight forward answer, but COM stuff is just a bit over my head. As far as audio format, I don't really care, as long as it gets into a file - can be wav, wma, or whatever. Even though I'm recording video, I need the video and audio files separate, so I can't just figure out how to add the audio into my video encoding.
I apologize for the late response, and I hope you can still find this valuable. I recently completed a project similar to yours (recording webcam video along with a selected microphone to a single video file with audio). The key is to creating an aggregate media source.
// http://msdn.microsoft.com/en-us/library/windows/desktop/dd388085(v=vs.85).aspx
HRESULT CreateAggregateMediaSource(IMFMediaSource *videoSource,
IMFMediaSource *audioSource,
IMFMediaSource **aggregateSource)
{
*aggregateSource = nullptr;
IMFCollection *pCollection = nullptr;
HRESULT hr = ::MFCreateCollection(&pCollection);
if (S_OK == hr)
hr = pCollection->AddElement(videoSource);
if (S_OK == hr)
hr = pCollection->AddElement(audioSource);
if (S_OK == hr)
hr = ::MFCreateAggregateSource(pCollection, aggregateSource);
SafeRelease(&pCollection);
return hr;
}
When configuring the sink writer, you will add 2 streams (one for audio and one for video).
Of course, you will also configure the writer correctly for the input stream types.
HRESULT hr = S_OK;
IMFMediaType *videoInputType = nullptr;
IMFMediaType *videoOutputType = nullptr;
DWORD videoOutStreamIndex = 0u;
DWORD audioOutStreamIndex = 0u;
IMFSinkWriter *writer = nullptr;
// [other create and configure writer]
if (S_OK == hr))
hr = writer->AddStream(videoOutputType, &videoOutStreamIndex);
// [more configuration code]
if (S_OK == hr)
hr = writer->AddStream(audioOutputType, &audioOutStreamIndex);
Then when reading the samples, you will need to pay close attention to the reader streamIndex, and sending them to the writer appropriately. You will also need to pay close attention to the format that the codec expects. For instance, IEEE float vs PCM, etc. Good luck, and I hope it is not too late.
Did you have hard time to manage DirectShow audio capture in Record directshow audio device to file?
Capturing with Media Foundation is hardly any simpler. Not even mentioning that in general there are a lot more resources on DirectShow out there....
MSDN offers you a WavSink Sample that implements audio capture into file:
Shows how to implement a custom media sink in Microsoft Media Foundation. The sample implements an archive sink that writes uncompressed PCM audio to a .wav file.
I am not sure why they decided to not make this a standard component. Having Media Foundation inferior to DirectShow in many ways, they could at least make this small thing as an advantage. Anyway, you have the sample and it looks like a good start.