How to validate properly ffmpeg pts/dts after demuxing/decoding? - c++

How should I validate pts/dts after demuxing and then after decoding?
For me it is significant to have valid pts all the time for days and
possibly weeks of continuous streaming.
After demuxing I check:
dts <= pts
prev_packet_dts < next_packet_pts
I also discard packets with AV_NOPTS_VALUE and wait for packets with
proper pts, because I don't know video duration at this case.
pts of packets can be not increasing because of I-P-B frames
Is it all right?
What about decoded AVFrames?
Should 'pts' be increasing all the time?
Why at some point 'pts' could lag behind 'dts'?
Why pict_type is a parameter of AVFrame? Should be at AVPacket, because
AVPacket is a compressed frame, not the opposite?

Ideally, yes. Unless if your format allows discontinuities, or wraps timestamps around due to overflow, like MPEG-TS.
Writing error.
It is an informational field, indicating the provenance of the frame. It can be used by filters or encoders, e.g. keyframe alignment during a re-encode.

At libav support I was advised to not rely on decoder output. It is more solid to produce pts/dts for encoding/muxing manually and I should search for ffmpeg tools sources to proper implementation. I will search for this approach.
For now I discard AVFrames only with AV_NOPTS_VALUE, and the rest of encoding/muxing works fine.
Validation of AVPackets after Demuxing remains the same, as described above.

Related

IMediaSample Time and MediaTime

What is the primary difference between SetTime and SetMediaTime?
Right now in my directshow livesource I calculate the time it like this
REFERENCE_TIME rtStart = m_rtLastSampleTime;
m_rtLastSampleTime += pVih->AvgTimePerFrame;
pms->SetTime(&rtStart, &m_rtLastSampleTime);
pms->SetSyncPoint(TRUE);
pms->SetDiscontinuity(rtStart <= 1);
This doesn't work with some encoders.
I've noticed that source that do work with these encoders set mediatime and they seem to jump up.
Media Times:
Optionally, the filter can also specify a media time for the sample. In a video stream, media time represents the frame number. In an audio stream, media time represents the sample number in the packet. For example, if each packet contains one second of 44.1 kilohertz (kHz) audio, the first packet has a media start time of zero and a media stop time of 44100. In a seekable stream, the media time is always relative to the start time of the stream. For example, suppose you seek to 2 seconds from the start of a 15-fps video stream. The first media sample after the seek has a time stamp of zero but a media time of 30.
Renderer and mux filters can use the media time to determine whether frames or samples have been dropped, by checking for gaps. However, filters are not required to set the media time. To set the media time on a sample, call the IMediaSample::SetMediaTime method.
I don't think it is actually used anywhere. SetTime, on the contrary, is important.

RTP timestamp not linear?

I was trying to reconstruct an audio conversation (a-b call using g711 audio) using the rtp time-stamp. I used to fill silence using difference of two rtp time-stamp and sampling rate. The conversation went out of sync and then I see that rtp time-stamp is not linear.I was not able to get exact clock time using rtp time-stamp and resulted in sync issues. How do i calculate the exact time.
I have the same problem with a Stream provided by GStreamer, whic doesnt provide monotonic timestamps.
for Example: The Difference between the stamps should bei exactly 1920, but it is between ~120 and ~3500, but in average 1920.
The problem here is that there is no way to find missing samples, because you never know if the high difference is from the Encoder delay or from a sample missing.
If you have only Audio to decode, I would try to put "valid" PTS values to each sample (in my case basetime+1920, basetime+3840 and so on.)
The big problem here comes when video AND audio were combined. Here this trick doesnt work well, when samples are missing and there is no way to find out when this is the case :(
when you want to send rtp you should notice about two things:
the time stamp is incremented due to the amout of byte sents.
e.g for PT=10, you may have this pattern:
1160 byte , time stamp increment: 1154 and wait 26 ms
lets see how this calculation happens:
. number of packet should be sent in one second : 1/(26ms) = 38
time stamp increment : clockrate / # = 1154
Regarding to RFC3550 (https://www.ietf.org/rfc/rfc3550.txt)
The sampling instant MUST be derived from a clock that increments
monotonically
Its not a choice nor an option. By the way please read the full description of the timestamp field of the RTP packet, there I found it also:
As an example, for fixed-rate audio
the timestamp clock would likely increment by one for each
sampling period. If an audio application reads blocks covering
160 sampling periods from the input device, the timestamp would be
increased by 160 for each such block, regardless of whether the
block is transmitted in a packet or dropped as silent.
If you want to check linearity then use the RTCP SR RTP and NTP timestamps field. At the SR report the RTP timestamp belongs to the NTP timestamp.
So the difference of two consecutive RTP timestamp (lets call them dRTPt_1, dRTP_2, ...) and the difference of two consecutive NTP timestamps (lets call them dNTP_1, dNTP_2, ...) and then multiply dRTP_t* with clock rate and check weather you get dNTP_t*.
But first please read the RFC.

Compressing PCM data

I'm using WinAPI - Wave functions to create a recording program that records the microphone for X seconds. I've searched a bit over the net, and found out PCM data is too large, and it'll be a problem to send it through sockets...
How can I compress it to something smaller? Any simple / "cheap" way ?
I've also noticed, when I'm declaring the format using the Wave API functions, I'm using this code :
WAVEFORMATEX pFormat;
pFormat.wFormatTag= WAVE_FORMAT_PCM; // simple, uncompressed format
pFormat.nChannels=1; // 1=mono, 2=stereo
pFormat.nSamplesPerSec=sampleRate; // 44100
pFormat.nAvgBytesPerSec=sampleRate*2; // = nSamplesPerSec * n.Channels * wBitsPerSample/8
pFormat.nBlockAlign=2; // = n.Channels * wBitsPerSample/8
pFormat.wBitsPerSample=16; // 16 for high quality, 8 for telephone-grade
pFormat.cbSize=0;
As you can see, pFormat.wFormatTag= WAVE_FORMAT_PCM;
maybe I can insert instead of WAVE_FORMAT_PCM something else, so it'll be compressed right away?
I've checked MSDN for other values, though none of them works for me in my Visual Studio...
So what can I do?
Thanks!
The simplest way is to simply reduce your sample rate from 44100 to something more manageable like 22050, 16000, 11025, or even 8000. Most voice codecs don't go higher than 16000 hz anyway. And the older ones are optimized for 8khz.
The next step is to find a codec. There's a handful of codecs to use with the Windows Audio Compression Manager, but almost all of them date back to Windows 95 and sound terrible by modern standards after being decompressed.
You can always convert to WMA in real time using the Format SDK or with Media Foundation APIs. Or just go get an open source MP3 library like LAME.
For telephone quality speech you can change to 8 bits per sample and a sample rate of 8000. This will greatly reduce the amount of data.
GSM has good compression. You can convert a block of PCM data to GSM (or any other codec you have installed) using acmStreamConvert(). Refer to MSDN for more details:
Converting Data from One Format to Another

Proper implementation of libspotify get_audio_buffer_stats callback

Can anyone help decipher the correct implementation of the libspotify get_audio_buffer_stats callback. Specifically, we are supposed to populate a sp_audio_buffer_stats buffer, consisting of samples and stutter?
According to the Docs:
int samples - Samples in buffer.
int stutter - Number of stutters (audio dropouts) since last query.
I'm wondering about "samples." What exactly is this referring to?
The music playback (audio_delivery) callback has a num_frames variable, but then you have the issue of audio format (channels and/or sample_rate).
Is it correct to set "samples" to total amount of "num_frames" currently in my buffer? Or do I need to run some math based on total "num_samples", "channels", and "sample_rate"
It should be the number of frames in your output buffer. I.e. int samples is slightly misnamed and should probably be called int frames instead.

FFMPEG reading keyframes

I am trying to write a c++ program that would read key frames from the video file using ffmpeg.
So far I managed to get all the frames using av_read_frame where you sequentially read
frame by frame.
But I having some problems using av_seek_frame which (if I am correct) supposed to do the trick for keyframes.
int av_seek_frame(AVFormatContext *s, int stream_index, int64_t timestamp, int flags);
I have FormatContext but what are other correct arguments to sequentially get only all keyframes ?
Is there other function that I can use instead?
Thanks
EDIT: In av_read_frame i am getting AVPacket, which I can use to get frame data, but how I can get packet by using av_seek_frame ?
SOLUTION: OK there is a simple boolean value in AVFrame->key_frame. True if its a keyframe
av_seek_frame has the ability to seek to a certain timestamp in a video file. It takes 4 parameters: a pointer to the AVFormatContext, a stream index, the timestamp to seek to and flags to select the direction and seeking mode.
The function will then seek to the first key frame before the given timestamp.
Check the documentation of that function for more information.