I'm trying to write an app which will capture a video stream of the screen and send it to a remote client. I've found out that the best way to capture a screen on Windows is to use DXGI Desktop Duplication API (available since Windows 8). Microsoft provides a neat sample which streams duplicated frames to screen. Now, I've been wondering what is the easiest, but still relatively fast way to encode those frames and send them over the network.
The frames come from AcquireNextFrame with a surface that contains the desktop bitmap and metadata which contains dirty and move regions that were updated. From here, I have a couple of options:
Extract a bitmap from a DirectX surface and then use an external library like ffmpeg to encode series of bitmaps to H.264 and send it over RTSP. While straightforward, I fear that this method will be too slow as it isn't taking advantage of any native Windows methods. Converting D3D texture to a ffmpeg-compatible bitmap seems like unnecessary work.
From this answer: convert D3D texture to IMFSample and use MediaFoundation's SinkWriter to encode the frame. I found this tutorial of video encoding, but I haven't yet found a way to immediately get the encoded frame and send it instead of dumping all of them to a video file.
Since I haven't done anything like this before, I'm asking if I'm moving in the right direction. In the end, I want to have a simple, preferably low latency desktop capture video stream, which I can view from a remote device.
Also, I'm wondering if I can make use of dirty and move regions provided by Desktop Duplication. Instead of encoding the frame, I can send them over the network and do the processing on the client side, but this means that my client has to have DirectX 11.1 or higher available, which is impossible if I would want to stream to a mobile platform.
You can use IMFTransform interface for H264 encoding. Once you get IMFSample from ID3D11Texture2D just pass it to IMFTransform::ProcessInput and get the encoded IMFSample from IMFTransform::ProcessOutput.
Refer this example for encoding details.
Once you get the encoded IMFSamples you can send them one by one over the network.
Related
What I am doing is trying to record the screen in windows XP and Win7. I got the bitmap by using DirectX's interface CreateOffscreenPlainSurface and GetFrontBufferData. I need to encode the bitmap into a H.264 format video. The problem is the bitmap captured is in format D3DFMT_A8R8G8B8, but the H.264 Video Encoder can only support MFVideoFormat_I420, MFVideoFormat_IYUV, MFVideoFormat_NV12, MFVideoFormat_YUY2 and MFVideoFormat_YV12 as input. My question is do I need to transfer the format myself(I do not want to)? Are there any other better solutions for this?
The input format corresponds to MFVideoFormat_ARGB32.
Stock OS component that handles the conversion is Video Processor MFT. I don't see availability information in the footer of MSDN article, however I am under impression that this MFT comes with Windows Vista, just like the whole Media Foundation API.
In Windows XP there has been a similar Color Converter DSP which offers really close services, and exposes a really close interface of DirectX Media Object (DMO). It is available in all more recent operating systems, however it is software only and never leverages GPU capability for the conversion.
These both can handle the requested format conversion for you.
Also for the reference, H.264 Video Encoder was introduced with Windows 7 only.
I'd like to decode the contents of a video file to a Direct3D11 texture and avoid the copies back and forth to CPU memory. Ideally, the library will play the audio itself and call back into my code whenever a video frame has been decoded.
On the surface, the Windows Media Foundation's IMFPMediaPlayer (ie MFPCreateMediaPlayer() and IMFPMediaPlayer::CreateMediaItemFromURL()) seem like a good match, except that the player decodes straight to the app's HWND. The documentation implies that I can add a custom video sink, but I have not been able to find documentation nor sample code on how to do that. Please point me in the right direction.
Currently, I am using libVLC to accomplish the above, but it only provides the video surface in CPU memory, which can become a bottleneck for my use-case.
Thanks.
Take a look at this source code from my project 'Stackoverflow' : MFVideoEVR
This program shows how to setup EVR (enhanced video renderer), and how to provide video samples to it, using a Source Reader.
The key is to provide video samples, so you can use them for your purpose.
This program provides samples through IMFVideoSampleAllocator. It is for DirectX9 texture. You need to change source code, and to use IMFVideoSampleAllocatorEx, instead : IMFVideoSampleAllocatorEx
About MFCreateVideoSampleAllocatorEx :
This function creates an allocator for DXGI video surfaces. The buffers created by this allocator expose the IMFDXGIBuffer interface.
So to retreive texture : IMFDXGIBuffer::GetResource
You can use this method to get a pointer to the ID3D11Texture2D interface of the surface. If the buffer is locked, the method returns MF_E_INVALIDREQUEST.
You will also have to manage sound through IMFSourceReader.
With this approach, there is no copy back to system memory.
PS : You don't talk about video format (h265, h264, mpeg2, others ??). MediaFoundation doesn't handle all video format, natively.
I'm developing USB camera streaming Desktop application using MediaFoundation SourceReader technique. The camera is having USB3.0 support and gives 60fps for 1080p MJPG video format resolution.
I used Software MJPEG Decoder MFT to convert MJPG to YUY2 frames and then converted into the RGB32 frame to draw on the window. Instead of 60fps, I'm able to render only 30fps on the window when using this software decoder. I have posted a question on this site and got some suggestion to use Intel Hardware MJPEG Decoder MFT to solve frame drop issue.
I have faced an error 0xC00D36B5 - MF_E_NOTACCEPTING when calling IMFTransform::ProcessInput() method. To solve this error, MSDN suggested using IMFTranform interface asynchronously. So, I used IMFMediaEventGenerator interface to GetEvent for every In/Out sample. Successfully, I can process only one input sample and then continuously IMFMediaEventGenerator:: GetEvent() methods returns MF_E_NO_EVENTS_AVAILABLE error(GetEvent() is synchronous).
I have tried to configure an asynchronous callback for SourceReader as well as IMFTransform but MFAsyncCallback:: Invoke method is not invoking, hence I planned to use GetEvent method.
Am I missing anything?If Yes, Someone guides me to use Intel Hardware Decoder into my project?
Intel Hardware MJPEG Decoder MFT is an asynchronous MFT and if you are managing it directly, you are responsible to apply asynchronous model. You seem to be doing this but you don't provide information that allows nailing the problem down. Yes, you are supposed to use event model described in ProcessInput, ProcessOutput sections of the article linked above. As you get the first frame, you should debug further to make it work with smooth continuous processing.
When you use APIs like media session our source reader, you have Media Foundation itself dealing with the MFTs. It is capable of doing synchronous and asynchronous consumption when appropriate. In this case, however, you don't do IMFTransform calls and even from your vague description it comes you are doing it wrong way.
I am about to grab the video output of my raspberry pi to pass it to kinda adalight ambient lightning system.
The XBMC's player for PI, omxplayer, users OpenMAX API for decoding and other functions.
Looking into the code gives the following:
m_omx_tunnel_sched.Initialize(&m_omx_sched, m_omx_sched.GetOutputPort(), &m_omx_render, m_omx_render.GetInputPort());
as far as I understand, this sets a pipeline between the video scheduler and the renderer [S]-->[R].
Now my idea is to write a grabber component and plug-in it hardly into the pipeline [S]-->[G]->[R]. The grabber will extract the pixels from the framebuffer and pass it to a deamon which will drive the leds.
Now I am about to dig into OpenMAX API which seems to be pretty weird. Where should I start? Is it a feasible approach?
Best Regards
If you want the decoded data then just do not send to the renderer. Instead of rendering, take the data and do whatever you want to do. The decoded data should be taken from the output port of the video_decode OpenMAX IL component. I suppose you'll also need to set the correct output pixel format, so set the component output port to the correct format you need, so the conversion is done by the GPU (YUV or RGB565 are available).
At first i think you should attach a buffer to the output of camera component, do everything you want with that frame in the CPU, and send a frame through a buffer attached to the input port of the render, its not going to be a trivial task, since there is little documentation about OpenMax on the raspberry.
Best place to start:
https://jan.newmarch.name/RPi/
Best place to have on hands:
http://home.nouwen.name/RaspberryPi/documentation/ilcomponents/index.html
Next best place: source codes distributed across the internet.
Good luck.
I'm trying to write an application that records and saves the screen in C++ on the windows platform. I'm not sure where to start with this. I assume I need some sort of API, (FFMPEG, maybe OpenGL?). Could someone point me in the right direction?
You could start by looking at Windows remote desktop protocol, maybe some programming libraries are provided for that.
I know of a product that intercepts calls into the Windows GDI dll and uses that to store the screen drawing activities.
A far more simpler approach would be to do screenshots as often as possible and somehow minimize redundant data (parts of the screen that didn't change between frames).
If the desired output of your app is a video file (like mpeg) you are probably better off just grabbing frames and feeding them into a video encoder. I don't know how fast the encoders are these days. Ffmpeg would be a good place to start.
If the encoder turns out not fast enough, you can try storing the frames and encoding the video file afterwards. Consecutive frames should have many matching pixels, so you could use that to reduce the amount of data stored.