I want to know once and for all, how time base calucaltion and rescaling works in FFMPEG.
Before getting to this question I did some research and found many controversial answers, which make it even more confusing.
So based on official FFMPEG examples one has to
rescale output packet timestamp values from codec to stream timebase
with something like this:
pkt->pts = av_rescale_q_rnd(pkt->pts, *time_base, st->time_base, AV_ROUND_NEAR_INF|AV_ROUND_PASS_MINMAX);
pkt->dts = av_rescale_q_rnd(pkt->dts, *time_base, st->time_base, AV_ROUND_NEAR_INF|AV_ROUND_PASS_MINMAX);
pkt->duration = av_rescale_q(pkt->duration, *time_base, st->time_base);
But in this question a guy was asking similar question to mine, and he gave more examples, each of them doing it differently. And contrary to the answer which says that all those ways are fine, for me only the following approach works:
frame->pts += av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
In my application I am generating video packets (h264) at 60 fps outside FFMPEG API then write them into mp4 container.
I set explicitly:
video_st->time_base = {1,60};
video_st->r_frame_rate = {60,1};
video_st->codec->time_base = {1 ,60};
The first weird thing I see happens right after I have written header for the output format context:
AVDictionary *opts = nullptr;
int ret = avformat_write_header(mOutputFormatContext, &opts);
av_dict_free(&opts);
After that ,video_st->time_baseis populated with:
num = 1;
den = 15360
And I fail to understand why.
I want someone please to exaplain me that.Next, before writing frame I calculate
PTS for the packet. In my case PTS = DTS as I don't use B-frames at all.
And I have to do this:
const int64_t duration = av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
totalPTS += duration; //totalPTS is global variable
packet->pts = totalPTS ;
packet->dts = totalPTS ;
av_write_frame(mOutputFormatContext, mpacket);
I don't get it,why codec and stream have different time_base values even though I explicitly set those to be the same. And because I see across all the examples that av_rescale_q is always used to calculate duration I really want someone to explain this point.
Additionally, as a comparison, and for the sake of experiment, I decided to try writing stream for WEBM container. So I don't use libav output stream at all.
I just grab the same packet I use to encode MP4 and write it manually into EBML stream. In this case I calculate duration like this:
const int64_t duration =
( video_st->codec->time_base.num / video_st->codec->time_base.den) * 1000;
Multiplication by 1000 is required for WEBM as the time stamps are presented in milliseconds in that container.And this works. So why in case of MP4 stream encoding there is a difference in time_base which has to be rescaled?
This behavior from ffmpeg confuses me too. It was discussed a little by users here - http://ffmpeg.org/pipermail/libav-user/2018-January/010843.html . But the resolution there was to just deal with the 15360 time_base rather than exert control over it.
From the source pointed out by the poster in that forum topic (https://github.com/FFmpeg/FFmpeg/blob/master/libavformat/movenc.c, search for "*= 2"), it doesn't look easily avoidable as far as I can tell. It appears your choice is to let the time_base get changed, or to pick something >= 10000 and then it will not be changed.
Related
Good day, everyone.
Working with FFmpeg.
And have some issues with decoding, I cannot find in docs and forums.
The code is not simple, so I will try to explain firstly in words, may be someone really skilled in ffmpeg will understand the issue by words. But if the code really helps, i will try to post it.
So, firstly in general, what i want to do. I want to capture voice, encode to mp3, get packet, send to network, on the other side: accept the packet, decode and play. Why not to use ffmpeg streaming? Well, because this packet will be modified a little bit, and may be encoded/crypted, so ffmpeg has no functions to do this, and i should do it manually.
Well, what i managed to do now. Now i can encode and decode via file. This code works fine, without any issues. So generally i can encode and decode mp3, and play, so i assume that my code for encoding/decoding works fine.
So, than i just change code, making no saving to file, and send packets via network instead.
This is the method, standard, well, i use to send packet to decoder on the accepting side.
result = avcodec_send_packet( codecContext, networkPacket );
if( result < 0 ) {
if( result != AVERROR(EAGAIN) ) {
qDebug() << "Some decoding error occured: " << result;
return;
}
}
networkPacket is the AVPacket restored from network.
AVPacket* networkPacket = NULL;
networkPacket = av_packet_alloc();
...
And that is the way i restore it.
void FFmpegPlay::processNetworkPacket( MediaPacket* mediaPacket ) {
qDebug() << "FFmpegPlay::processNetworkPacket start";
int result;
AVPacket* networkPacket = NULL;
networkPacket = av_packet_alloc();
networkPacket->size = mediaPacket->data.size();
networkPacket->data = (uint8_t*) malloc( mediaPacket->data.size() + AV_INPUT_BUFFER_PADDING_SIZE );
memcpy( networkPacket->data, mediaPacket->data.data(), mediaPacket->data.size() );
networkPacket->pts = mediaPacket->pts;
networkPacket->dts = mediaPacket->dts;
networkPacket->flags = mediaPacket->flags;
networkPacket->duration = mediaPacket->duration;
networkPacket->pos = mediaPacket->pos;
...
And there i get -22, EINVAL, invalid argument.
Docs tell me:
AVERROR(EINVAL): codec not opened, it is an encoder, or requires flush
Well, my codec really opened, this is decoder, and this is first call, so i think that flush is not required. So i assume, that issue is in packet and codec setup. I also tried different flags, and always get this error. Codec just doesn't want to accept packet.
So, now i have explained the situation.
And question is: Is there any special options or flags for ffmpeg mp3 decoder to implement what is explained above? Which one of them i should change?
Upd.
After some testing, i decided to make more clear test and check if i can decode immediately after encoding, without network, and it looks like i can do that.
So it looks like in network case the decoder should be initialized somehow special, or have some options.
Well, i'm dealing initialization by copying AVCodecParameters from original and sending them by network. Maybe i should change them some special way?
I'm stuck on this one. And have no idea how to deal with it. So any help is appreciated.
Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.
The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.
I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.
From what I have found, the function that I need to use is WebRtcVad_Process(). It's prototype is written below :
int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
size_t frame_length)
From what I found here : https://stackoverflow.com/a/36826564/6487831
Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long.
Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:
int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);
It makes sense :
1 sample = 2B = 16 bits
SampleRate = 16000 sample/sec = 16 samples/ms
For 10 ms, no of samples = 160
So, based on that I have implemented this :
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout<<ms<<" ms : "<<isActive<<std::endl;
temp = temp + 160; // processed 160 samples
}
Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.
So,
Is it possible to use the samples parsed directly from the wav files, or does it need some processing?
Am I looking at the correct function to do the job?
How to use the function to properly perform VAD on the audio stream?
Is it possible to distinct between the spoken words?
What is the best way to check if the output I am getting is correct?
If not, what is the best way to do this task?
I'll start by saying that no, I don't think you will be able to segment an utterance into individual words using VAD. From the article on speech segmentation in Wikipedia:
One might expect that the inter-word spaces used by many written
languages like English or Spanish would correspond to pauses in their
spoken version, but that is true only in very slow speech, when the
speaker deliberately inserts those pauses. In normal speech, one
typically finds many consecutive words being said with no pauses
between them, and often the final sounds of one word blend smoothly or
fuse with the initial sounds of the next word.
That said, I'll try to answer your other questions.
You need to decode the WAV file, which could be compressed, into raw PCM audio data before running VAD. See e.g. Reading and processing WAV file data in C/C++. Alternately, you could use something like sox to convert the WAV file to raw audio before running your code. This command will convert a WAV file of any format to 16 KHz, 16-bit PCM in the format that WebRTCVAD expects:
sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
It looks like you are using the right function. To be more specific, you should be doing this:
#include "webrtc/common_audio/vad/include/webrtc_vad.h"
// ...
VadInst *vad;
WebRtcVad_Create(&vad);
WebRtcVad_Init(vad);
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout << ms << " ms : " << isActive << std::endl;
temp = temp + 160; // processed 160 samples (320 bytes)
}
To see if it's working, you can run known files and see if you get the results you expect. For example, you could start by processing silence and confirm that you never (or rarely--this algorithm is not perfect) see a voiced result come back from WebRtcVad_Process. Then you could try a file that is all silence except for one short utterance in the middle, etc. If you want to compare to an existing test, the py-webrtcvad module has a unit test that does this; see the test_process_file function.
To do word-level segmentation, you will probably need to find a speech recognition library that does it or gives you access to the information that you need to do it. E.g. this thread on the Kaldi mailing list seems to talks about how to segment by words.
I'm working on a remote desktop application, I would like to send an encoded H264 packet over TCP by using ffmpeg for the encoding. However I couldn't find useful info for the particular case of encoding just one frame (already on YUV444) and get the packet.
I have several issues, the first was that:
avcodec_encode_video2
Was not blocking, I found that most of the time you get the "delayed" frames at the end, however, since this is a real time streaming the solution was:
av_opt_set(mCodecContext->priv_data, "tune", "zerolatency", 0);
Now I got the frame, but several issues, it takes a while and even worse I got a gray with trash pixels video as result. My configuration for the Codec Context:
m_pCodecCtx->bit_rate=8000000;
m_pCodecCtx->codec_id=AV_CODEC_ID_H264;
m_pCodecCtx->codec_type = AVMEDIA_TYPE_VIDEO;
m_pCodecCtx->width=1920;
m_pCodecCtx->height=1080;
m_pCodecCtx->pix_fmt=AV_PIX_FMT_YUV444P;
m_pCodecCtx->time_base.num = 1;
m_pCodecCtx->time_base.den = 25;
m_pCodecCtx->gop_size = 1;
m_pCodecCtx->keyint_min = 1;
m_pCodecCtx->i_quant_factor = float(0.71);
m_pCodecCtx->b_frame_strategy = 20;
m_pCodecCtx->qcompress = (float)0.6;
m_pCodecCtx->qmax = 51;
m_pCodecCtx->qmin = 20;
m_pCodecCtx->max_qdiff = 4;
m_pCodecCtx->refs = 4;
m_pCodecCtx->max_b_frames = 1;
m_pCodecCtx->thread_count = 1;
I would like to know how this could be done, how do I set the "I Frames"? and, that would be the optimal for a "one at a time" encoding? Also I'm not concerned right now with the quality, just need to be fast enough (under 16 ms).
For the encoding part:
nres = avcodec_encode_video2(m_pCodecCtx,&packet,m_pFrame,&framefinished);
if(nres<0){
qDebug() << "error encoding: " << nres << endl;
}
if(framefinished){
m_pFrame->pts++;
ofstream vidout("video.h264",ios::app);
if(vidout.good()){
vidout.write((const char*)&packet.data[0],packet.size);
}
vidout.close();
av_packet_unref(&packet);
}
I'm not using a container, just a raw file, ffplay reproduce raw files if the packets are right, and that's my principal issue. I'm planning to send the packet over tcp and decode on the client. Any help would be greatly appreciated.
You could take a look at the source code of webrtc.
It use openh264 and ffmpeg to accomplish your work.
I was study in it for a while. But I can't the the latest source code currently.
I found this :
source code.
Hope it helps.
Turns out I got it working since the beginning, I made very simple but important mistake, I was writing as text a binary file, so...
Thanks for the feedback and your help
i use isamplegrabber sampleCB callback to get audio sample, i can get buffer and buffer length from imediasample and i use avcodec_fill_audio_frame(frame,ost->enc->channels,ost->enc->sample_fmt,(uint8_t *)buffer,length,0) to make an avframe , but this frame does not make any audio in my mux file! i think the length is very smaller than frame_size.
can every one help me please? or give me some example if it is possible.
thank you
this is my samplecb code :
HRESULT AudioSampleGrabberCallBack::SampleCB(double Time, IMediaSample*pSample){
BYTE *pBuffer;
pSample->GetPointer(&pBuffer);
long BufferLen = pSample->GetActualDataLength();
muxer->PutAudioFrame(pBuffer,BufferLen);
}
and this is samplegrabber pin media type :
AM_MEDIA_TYPE pmt2;
ZeroMemory(&pmt2, sizeof(AM_MEDIA_TYPE));
pmt2.majortype = MEDIATYPE_Audio;
pmt2.subtype = FOURCCMap(0x1602);
pmt2.formattype = FORMAT_WaveFormatEx;
hr = pSampleGrabber_audio->SetMediaType(&pmt2);
after that i using ffmpeg muxing example to process frames and i think i need only to change the signal generating part of code :
AVFrame *Muxing::get_audio_frame(OutputStream *ost,BYTE* buffer,long length)
{
AVFrame *frame = ost->tmp_frame;
int j, i, v;
uint16_t *q = (uint16_t*)frame->data[0];
int buffer_size = av_samples_get_buffer_size(NULL, ost->enc->channels,
ost->enc->frame_size,
ost->enc->sample_fmt, 0);
// uint8_t *sample = (uint8_t *) av_malloc(buffer_size);
av_samples_alloc(&frame->data[0], frame->linesize, ost->enc->channels, ost->enc->frame_size, ost->enc->sample_fmt, 1);
avcodec_fill_audio_frame(frame, ost->enc->channels, ost->enc->sample_fmt,frame->data[0], buffer_size, 1);
frame->pts = ost->next_pts;
ost->next_pts += frame->nb_samples;
return frame;
}
The code snippets suggest you are getting AAC data using Sample Grabber and you are trying to write that into file using FFmpeg's libavformat. This can work out.
You initialize your sample grabber to get audio data in WAVE_FORMAT_AAC_LATM format. This format is not so wide spread and you are interested in reviewing your filter graph to make sure the upstream connection on the Sample Grabber is such that you expect. There is a chance that somehow there is a weird chain of filter that pretend to produce AAC-LATM and the reality is that the data is invalid (or not even reaching grabber callback). So you need to review the filter graph (see Loading a Graph From an External Process and Understanding Your DirectShow Filter Graph), then step through your callback with debugger to make sure you get the data and it makes sense.
Next thing, you are expected to initialize AVFormatContext, AVStream to indicate that you will be writing data in AAC LATM format. Provided code does not show you are doing it right. The sample you are referring to is using default codecs.
Related reading: Support LATM AAC in MP4 container
Then, you need to make sure that both incoming data and your FFmpeg output setup are in agreement about whether the data has or does not have ADTS headers, the provided code does not shed any light on this.
Furthermore, I am afraid you might be preparing your audio data incorrectly. The sample in question generates raw audio data and applies encoder to produce compressed content using avcodec_encode_audio2. Then a packed with compressed audio is being sent to writing using av_interleaved_write_frame. The way you attached your code snippets to the question makes me thing you are doing it wrong. For starters, you still don't show relevant code which makes me think you have troubles identifying what code is relevant exactly. Then you are dealing with your AAC data as if it was raw PCM audio in get_audio_frame code snippet whereas you are interested in reviewing FFmpeg sample code with the thought in mind that you already have compressed AAC data and sample gets to thins point after return from avcodec_encode_audio2 call. This is where you are supposed to merge your code and the sample.
I would like to produce a zerolatency live video stream and play it in VLC player with as little latency as possible.
This are the settings I currently use:
x264_param_default_preset( &m_Params, "veryfast", "zerolatency" );
m_Params.i_threads = 2;
m_Params.b_sliced_threads = true;
m_Params.i_width = m_SourceWidth;
m_Params.i_height = m_SourceHeight;
m_Params.b_intra_refresh = 1;
m_Params.b_vfr_input = true;
m_Params.i_timebase_num = 1;
m_Params.i_timebase_den = 1000;
m_Params.i_fps_num = 1;
m_Params.i_fps_den = 60;
m_Params.rc.i_vbv_max_bitrate = 512;
m_Params.rc.i_vbv_buffer_size = 256;
m_Params.rc.f_vbv_buffer_init = 1.1f;
m_Params.rc.i_rc_method = X264_RC_CRF;
m_Params.rc.f_rf_constant = 24;
m_Params.rc.f_rf_constant_max = 35;
m_Params.b_annexb = 0;
m_Params.b_repeat_headers = 0;
m_Params.b_aud = 0;
x264_param_apply_profile( &m_Params, "high" );
Using those settings, I have the following issues:
VLC shows lots of missing frames (see screenshot, "verloren"). I am not sure if this is an issue.
If I set a value <200ms for the network stream delay in VLC, VLC renders a few frames and than stops to decode/render frames.
If I set a value >= 200ms for the network stream delay in VLC, everything looks good so far but the latency is, obviously, 200ms, which is too high.
Question:
Which settings (x264lib and VLC) should I use in order to encode and stream with as little latency as possible?
On your x264 settings: many are redundant ie already contained in "zerolatency". However, as best as I can tell, your encoding latency is nevertheless zero frames, ie you put one frame in and you immediately (as soon as your CPU has finished encoding it, anyway) get one frame out. It never waits for a newer frame in order to give an encoded older frame (the way it would with lookahead, for example).
On why vlc pauses unless you give it a large network delay: The problem is that your combination of rate control and vbv settings when encoding is not ideal. What you want to do for low latency encode is to use CBR, and set the VBV buffer to the size of one frame, exactly. This enables a special VBV calculation, if you look in the x264 source.
You may also try not setting anything timing related whatsoever (no fps, no vbv) and use CRF with zerolatency. The results would depend on what container the video is packaged in for streaming.
Read this for more info: http://x264dev.multimedia.cx/archives/249
If you want to have the fastest possible encoding, then delete everything after
x264_param_default_preset( &m_Params, "veryfast", "zerolatency" );
and change veryfast to ultrafast. The rest is because of network delay + decoding.