GOP size for realtime video stream - c++

I'm working on a kind of rich remote desktop system, with a video stream of the desktop encoded using avcodec/x264. I have to set manually the GOP size for the stream, and so far I was using a size of fps/2.
But I've just read the following on Wikipedia:
This structure [Group Of Picture# suggests a problem because the fourth frame (a P-frame) is needed in order to predict the second and the third (B-frames). So we need to transmit the P-frame before the B-frames and it will delay the transmission (it will be necessary to keep the P-frame).
It means I'm creating a lot of latency since the client needs to receive at least half of the GOP to output the first frame following the I frame. What is the best strategy for the GOP size if I want the smallest latency possible ? A gop of 1 picture ?

If you want to minimize latency with h264, you should generally avoid b-frames. This way the decoder has at least a chance to emit decoded frames early. This prevents decoder-induced latency.
You may also want to tune the encoder for latency, by reducing/disabling look-ahead. x264 has a "zero-latency" setting which should be a good starting point for finding you optimal settings.
The "GOP" size (which afaik is not really defined for h264; I'll just assume you mean the I(DR)-frame interval) does not necessarily affect the latency. This parameter only affects how long a client will have to wait until it can "sync" on the stream (time-to-first-picture).

Related

Seeking within MP3 file

I am working on the development of driving software for the hardware implementation by these people. The decoder works properly in overall, but I am struggling making it starting playing the sound at the middle. I suspect that it is common feature of the MP3 decoders as they must have some history of data in order to properly construct current sound (I am not that skilled in MPEG, however have an idea of some basics).
The problem is that this decoder is a black box, and any deepening in its code is an enormous time and effort.
I empirically found out that the sound garbage, when starting somewhere in the middle, happens in no more that 1 (one) seconds after start with file # 320 kbps and 44100 sampling rate. I am actually ok to mute decoder for a second (while it gathers/decodes proper required data for further playback), and then unmute it to continue playback.
I did search on the internet for the matter, did not find anything useful. Tried to invalidate first frames by corrupting frame headers (the easiest that could be done without going into the MP3 headers/data), made things even worse.
Questions:
is there any body of knowledge of how players perform seek in MP3 files and keep non-corrupt sound?
Is my action plan seem valid - mute for 1 second while decoder plays garbage? Is there any way to (easily) calculate the time I must mute output for?
Update: just tried on another file # 128 kbps/48k and maximal garbage time to be about 2 seconds... I can not believe that decoder with so limited resources - input buffer used is 2 kB with some intermediate working buffers, in total must be not more than 36 kB - can keep the history for 2 seconds, or decoder is having problems finding the sync word in the stream... and thus my driver needs to figure out the frame start (by finding out sync word, reading frame header, calculating frame size, and looking after the frame to contain another sync word).
I've found workarounds. The difficulty was that there are actually two problems overlaying each other, but was easy to cope with having structured approach.
The decoder is having issues getting the first sync word of the stream, and works very well when the first bytes supplied to it are FF FB or FF FA. All other bytes - in the middle of the frame - with very high probability, cause major sound corruption, until decoder catches correct sync. Thus I designed the code seeking to the next frame start after the seek point, checking that this is actual start of the frame by calculating frame size and looking at the next frame to contain FFFB/FA.
Having fixed the problem 1 I have had minor corruption left from the decoder starting decoding the frame without historical data. I have solved it by muting the decoder for the first 4 buffering transactions.
Major corruption still happens, but is rare, and it seems that nature of corruption depends on what was in the decoder buffers (not only Huffman input buffer, but other intermediate buffers) before the decoder is instructed to start. My hardware performs clear of the input buffers to 0 when decoder is in reset state, but it seems to be not enough (or just incorrect)...
The decoder itself is a kind of PoC (proof of concept) work, a student term with the aim to prove that they were able to make it; the package is having test bench code, but lacks low level documentation/comments in the code, and is not ready for field implementation and production. In general the fact that it worked for me at all (almost) out of the box makes the honor to the developers and is a mark of high quality of their work. I have reviewed and tried several published projects for MP3 decoders for silicon implementation (FPGA) and concluded that this one is the best available. In addition, the license they provide their work on is generous one.
Update: my research have shown that the most problem lies not in the input buffer (however it is possible to improve the situation by uploading 528 bytes of historical data to the decoder's buffer so that it would be able to grab main data from previous frame), but in the internal state of the decoder. Its documentation says:
To reduce resource usage, part of the RAM for buffering the intermediate data is also shared with Huffman decoding as bit reservoir ...
thus it is a contents of the reservoir and intermediate computed data affecting the decoding. I have confirmed it by starting various set of frames in different sequence, and if set of frames are played in different sequence, nature of garbage changes, or garbage may simply not appear.
Thus, unfortunately, my conclusion: it is not possible to properly seek using this decoder as is. I even do not think it is possible to "fake" playback (to quickly "play" the file till the needed point in buffers) as all three clocks are tied to each other.
I will keep my "best tested" implementation, with the notes on the quality.
Update 2: I was wrong, it is possible to seek softly, but to mitigate the sound corruption (yes, I am still unsure if I fixed it completely) I had to find another deficiency in the decoder: it is related to timing, decoder assumes that further data is always available in the buffer, while it may not be there yet. (It is actually clear from the test bench code supplied within the IP - the way data was replenished during QA and testing). In the cases I caught the corruption, first frames in the first part of the input buffer RAM were not decoded properly, skipped, and decoder quickly skips to second part of the RAM, assuming new data is there, however driving hardware is not ready yet fetching required data and putting this data into the second part of decoder's buffer RAM, thus corruption persisted for quite a long time with decoder looping skipping "invalid" frames until it catches correct image of the frame and normalizes its pace through the buffer.
Now the solution:
play (almost) 5 frames of silence through decoder before unmuting it. This will ensure all decoder's internal buffers are purged. It will not take much time, however requires some coding;
introduce a possibility to set huffman's decoder starting pointer readptr (in huffctl.v) after reset into the value other than 0. It will give the flexibility to have some history data uploaded into the decoder's buffer and start huffman decoder from the middle of the buffer rather than from its very start;
calculate the position to seek to, it calculates relatively easily for MPEG-1 Layer-3: duration=(filesize-ID3size)/(bitrate/8*1000), newPosition=ID3size+seekTime*(bitrate/8*1000). Duration is needed to check that position to seek to fits into the play time, alternatively newPosition can be used to check against file size. These forumlas do not take into account older tag versions appearing at the end of the file, but they are usually not more than 128 bytes, thus a kind of negligible for timing calculation relative to average MP3 sound file size; it also assumes CBR (VBR will require completely different way, requiring more power and data I/O for accurate seeking). Funny enough I found web pages with incorrect duration calculation formula, thus beware posts by ignorant people with cool job titles;
Seek to the calculated position, find next frame from this position on, calculate frame size, and ensure that there's next valid frame at that distance. New pointer will point to this next frame found at the distance;
find out the main_data_begin lookback pointer of the frame now being pointed to at step 4. Decrease the new pointer by this value so that pointer points within previous frame to the start of the main data for the current frame - it will be a pointer for the decoder data start. Note that it will fail if main data begins in more than one frame back (removal of headers of previous frame(s) will be required for proper operation);
fill decoder's buffer starting pointer identified in step 5, and set decoder's decoding start pointer to the one identified in step 4. While the implementation assumes you fill buffer in halves, do it different from the start: fill the whole buffer instead of just a first half. For this, after reset, set idle bit, check for data request, reset idle bit, perform two 1024 byte transfers to the decoder's buffer (effectively filling it completely), and then set idle bit, then reset it, and then set it again;
after performing step 7 continue normally replenishing 1024 bytes per decoder's request.
Employing this plan I had zero sound corruption cases. As you see it requires some changes to Verilog, but it must be easy if you know basics or hardware, know Verilog amd can perform reverse engineering.

How to detect camera frame loss using Windows media API like Media Foundation or DirectShow?

I am writing an application for Windows that runs a CUDA accelerated HDR algorithm. I've set up an external image signal processor device that presents as a UVC device, and delivers 60 frames per second to the Windows machine over USB 3.0.
Every "even" frame is a more underexposed frame, and every "odd" frame is a more overexposed frame, which allows my CUDA code perform a modified Mertens exposure fusion algorithm to generate a high quality, high dynamic range image.
Very abstract example of Mertens exposure fusion algorithm here
My only problem is that I don't know how to know when I'm missing frames, since the only camera API I have interfaced with on Windows (Media Foundation) doesn't make it obvious that a frame I grab with IMFSourceReader::ReadSample isn't the frame that was received after the last one I grabbed.
Is there any way that I can guarantee that I am not missing frames, or at least easily and reliably detect when I have, using a Windows available API like Media Foundation or DirectShow?
It wouldn't be such a big deal to miss a frame and then have to purposefully "skip" the next frame in order to grab the next oversampled or undersampled frame to pair with the last frame we grabbed, but I would need to know how many frames were actually missed since a frame was last grabbed.
Thanks!
There is IAMDroppedFrames::GetNumDropped method in DirectShow and chances are that it can be retrieved through Media Foundation as well (never tried - they are possibly obtainable with a method similar to this).
The GetNumDropped method retrieves the total number of frames that the filter has dropped since it started streaming.
However I would question its reliability. The reason is that with these both APIs, the attribute which is more or less reliable is a time stamp of a frame. Capture devices can flexibly reduce frame rate for a few reasons, including both external like low light conditions and internal like slow blocking processing downstream in the pipeline. This makes it hard to distinguish between odd and even frames, but time stamp remains accurate and you can apply frame rate math to convert to frame indices.
In your scenario I would however rather detect large gaps in frame times to identify possible gap and continuity loss, and from there run algorithm that compares exposure on next a few consecutive frames to get back to sync with under-/overexposition. Sounds like a more reliable way out.
After all this exposure problem is highly likely to be pretty much specific to the hardware you are using.
Normally MFSampleExtension_Discontinuity is here for this. When you use IMFSourceReader::ReadSample, check this.

Are there any constraints to encode a audio signal?

I capture a pcm sound at some sampling rate, e.g. 24 kHz. I need to encode it using some codec (I use Opus for that) to send over network. I noticed that at some sampling rate I use for encoding with Opus, I often hear some extra "cracking" noise at the receiving end. At other rates, it sounds ok. That might be an implementation bug, but I though there might be some constraints also that I don't know.
I also noticed that if I use another sampling rate while decoding Opus-encoded audio stream, I get a lower or higher pitch of sound, which seems logical to me. So I've read, that I need to resample on the other end, if receiving side doesn't support the original PCM sampling rate.
So I have 2 questions regarding all this:
Are there any constraints on sampling rate (or other parameters) of audio encoding? (Like I have a 24kHz pcm sound - maybe there are certain sample rates to use with it?)
Are there any common techniques to provide the same sound quality at both sides when sending audio stream over network?
The crackling noises are most likely a bug, since there is no limitations to the samplerate that would result in this kind of noise (there are other kinds of signal changes that come with sample rate conversion, especially when downsampling to a lower samplerate; but definitely not crackling).
A wild guess would be, that there is something wrong with the input buffer. Crackling often occurs if samples are omitted or duplicated, oftentimes the result of the boundaries of subsequent buffers not being correct.
Sending audio data over network in realtime will require compression, no matter what. The required data rate is simply too high. There are codecs which provide lossless audio compression (e.g. FLAC), but their compression ratio is comparatively low compared to e.g. Opus.
The problem was solved by buffering packets at receiving end and writing them to the soundcard buffer as soon as some amount has been reached. The 'crackling' noise was then most likely due to the gaps between subsequent frames that were sent to the soundcard buffer

How to implement a TCP traffic limitation for the sender side?

I'm about to implement a webcam video chat system for multiple users in C++ (windows/linux). As the 'normal' user is usually connected via DSL/cable, there is a strong bandwidth limitation for my (prefered) TCP/IP connections.
The basic idea is to transmit the highest possible framerate given a bandwidth limitation for the sender side. (Other applications may still require internet bandwidth in the background.) In a second step, the camera-capture-rate shall be automatically adjusted to the network limitations to avoid unncessary CPU overhead.
What I have is a constant stream of compressed images (with strongly variing buffer sizes) that have to be transmitted to the remote side. Given a limitation of let's say 20kb/s, how do I best implement that limitation? (Note that the user shall define this limit!)
Thx in advance,
Mayday
Edit: Question clearifications (sry!)
It's about how to traffic-shape an arbitrary TCP/IP connection.
It's not how to implement image rate/quality reduction as my use-case suggests. (Altough I didn't consider to automatically adjust image compression, yet. (Thx Jon))
There are two things you can do to reduce your bandwidth:
Send smaller images (more compression)
Send less images
When implementing an algorithm that picks image size and quantity to honor the user-selected limit, you have to balance between a simple/robust algorithm and a performant algorithm (one that makes maximum use out of the limit).
The first approach I would try is to use a rolling average of the bandwidth you are using at any point in time to "seed" your algorithm. Every once in a while, check the average. If it becomes more than your limit, instruct the algorithm to use less (in proportion to how much you overstepped the limit). If it becomes significantly lower than your limit, say less than 90%, instruct the algorithm to use more.
The less/more instruction might be a variable (maybe int or float, really there is much scope for inventiveness here) used by your algorithm to decide:
How often to capture an image and send it
How hard to compress that image
You need a buffer / queue of at least 3 frames:
One frame currently being sent to the network;
One complete frame to be sent next;
One frame currently being copied from the camera.
When the network sender finishes sending a frame, it copies the "to be sent next" frame to the "currently sending" slot. When the camera reader finishes copying a frame from the camera, it replaces the "to be sent next" frame with the copied frame. (Obviously, synchronisation is required around the "to be sent next" frame).
The sender can then modulate its sending rate as it sees fit. If it's running slower than the camera, it will simply drop frames.

C/C++ library for seekable movie format

I'm doing some processing on some very large video files (often up to 16MP), and I need a way to store these videos in a format that allows seeking to specific frames (rather than to times, like ffmpeg). I was planning on just rolling my own format that concatenates all of the individually zlib compressed frames together, and then appends an index on the end that links frame numbers to file byte indices. Before I go about this though, I just wanted to check to make sure I'm not duplicating the functionality of another format/library. Has anyone heard of a format/library that allows lossless compression and random access of videos?
The reason it is hard to seek to a specific frame in most video codecs is that most frames depend on another frame or frames, so frames must be decoded as a group. For this reason, most libraries will only let you seek to the closest I-frame (Intra-frame - independently decodable frame). To actually produce an image from a non-I-frame, data from other frames is required, so you have to decode a number of frames worth of data.
The only ways I have seen this problem solved involve creating an index of some kind on the file. In other words, make a pass through the file and create an index of what frame corresponds to a certain time or section of the file. Since the seeking functions of most libraries are only able to seek to an I frame so you may have to seek to the closest I-frame and then decode from there to the exact frame you want.
If space is not of high importance, I would suggest doing it like you say, but use JPEG compression instead of zlib as it will give you a lot higher compression ratio since it exploits the fact you are dealing with image data.
If space is an issue, P frames (depend on previous frame/frames) can greatly reduce the size of the file. I would not mess with B frames (depend on previous and future frame/frames) since they make it much harder to get things right.
I have solved the problem of seeking to a specific frame in the presence of B and P frames in the past using ffmpeg (libavformat) to demux the video into packets (1 frame's worth of data per packet) and concatenate these into a single file. The important thing is to keep and index into that file so you can find packet bounds for a given frame. If the frame is an I-frame, you can just feed that frame's data into an ffmpeg decoder and it can be decoded. If the frame is a B or P frame, you have to go back to the last I-frame and decode forward from there. This can be quite tricky to get right, especially for B-frames since they are often sent in a different order than how they are displayed.
Some formats allow you to change the number of key frames per second.
For example, I've used ffmpeg to encode to flv at 25 frames per second with 25 key frames per second, and then used a player that was fine in moving to key frames. Basically this allowed me to do frame by frame seeking.
Also the last time I checked quicktime can do frame by frame seek without having to have each frame being a key frame.
May not be applicable to you but that's my thoughts.