Why codecs x264/x265 ignores pts and dts of input frame? - c++

I'm trying to encode images from a webcam with libx265 (libx264 tried earlier) ...The webcam can not shoot with stable FPS because of the different amount of light entering the matrix and, as a result, different delays. Therefore, I count the fps and dts of the incoming frame and set these values for the corresponding parameters of the x265_image object, and init the encoder fpsNum with 1000 and fpsDenom with 1 (for millisecond timebase).
The problem is that the encoder ignores pts and dts of input image and encodes at 1000 fps! The same trick with timebase produces smooth record with libvpx. Why it does not work with x264/x265 codecs?
Here is parameters initialization:
...
error = (x265_param_default_preset(param, "fast", "zerolatency") != 0);
if(!error){
param->sourceWidth = width;
param->sourceHeight = height;
param->frameNumThreads = 1;
param->fpsNum = 1000;
param->fpsDenom = 1;
// Intra refres:
param->keyframeMax = 15;
param->intraRefine = 1;
// Rate control:
param->rc.rateControlMode = X265_RC_CQP;
param->rc.rfConstant = 12;
param->rc.rfConstantMax = 48;
// For streaming:
param->bRepeatHeaders = 1;
param->bAnnexB = 1;
encoder = x265_encoder_open(param);
...
}
...
Here is frame adding function:
bool hevc::Push(unsigned char *data){
if(!error){
std::lock_guard<std::mutex> lock(m_framestack);
if( timer > 0){
framestack.back()->dts = clock() - timer;
timer+= framestack.back()->dts;
}
else{timer = clock();}
x265_picture *picture = x265_picture_alloc();
if( picture){
x265_picture_init(param, picture);
picture->height = param->sourceHeight;
picture->stride[0] = param->sourceWidth;
picture->stride[1] = picture->stride[2] = picture->stride[0] / 2;
picture->planes[0] = new char[ luma_size];
picture->planes[1] = new char[chroma_size];
picture->planes[2] = new char[chroma_size];
colorspaces::BGRtoI420(param->sourceWidth, param->sourceHeight, data, (byte*)picture->planes[0], (byte*)picture->planes[1], (byte*)picture->planes[2]);
picture->pts = picture->dts = 0;
framestack.emplace_back(picture);
}
else{error = true;}
}
return !error;
}
Global PTS is increasing right after x265_encoder_encode call:
pts+= pic_in->dts; and sets as a pts of new image from framestack queue when it comes to encoder.
Can the x265/x264 codecs encode at variable fps? How to configure it if yes?

I don't know about x265 but in x264 to encode variable frame rate (VFR) video you should enable x264_param_t.b_vfr_input option which was disabled by your zerolatency tuning (VFR encode need 1 frame latency). Also at least in x264 timebase should be in i_timebase_num/i_timebase_den and i_fps_num/i_fps_den to be average fps (or keep default 25/1 if you don't know fps) or you will broke ratecontrol.

Related

stereo ping pong delay c++

I have to create a stereo ping pong delay with these parameters.
• Delay Time (0 – 3000 milliseconds)
• Feedback (0 – 0.99)
• Wet / Dry Mix (0 – 1.0)
I have managed to implement the stereo in/out and the 3 parameters, but struggling with how to implement the ping pong. I have this code in the process block, but it only replays the left and right in the opposite channels once. Is there a simple way to loop this to reply over and over and not just once or have is this not the best way to implement ping pong. Any help would be great!
//ping pong implementation
for (int i = 0; i < buffer.getNumSamples(); i++)
{
// Reduce the amplitude of each sample in the block for the
// left and right channels
//channelDataLeft[i] = channelDataLeft[i] * 0.5;
// channelDataRight[i] = channelDataRight[i] * 0.25;
if (i % 2 == 1) //if i is odd this will play
{
// Calculate the next output sample (current input sample + delayed version)
float outputSampleLeft = (channelDataLeft[i] + (mix * delayDataLeft[readIndex]));
float outputSampleRight = (channelDataRight[i] + (mix * delayDataRight[readIndex]));
// Write the current input into the delay buffer along with the delayed sample
delayDataLeft[writeIndex] = channelDataLeft[i] + (delayDataLeft[readIndex] * feedback);
delayDataRight[writeIndex] = channelDataRight[i] + (delayDataRight[readIndex] * feedback);
// Increment read and write index, check to see if it's greater than buffer length
// if yes, wrap back around to zero
if (++readIndex >= delayBufferLength)
readIndex = 0;
if (++writeIndex >= delayBufferLength)
writeIndex = 0;
// Assign output sample computed above to the output buffer
channelDataLeft[i] = outputSampleLeft;
channelDataRight[i] = outputSampleRight;
}
else //if i is even then this will play
{
// Calculate the next output sample (current input sample + delayed version swapped around from if)
float outputSampleLeft = (channelDataLeft[i] + (mix * delayDataRight[readIndex]));
float outputSampleRight = (channelDataRight[i] + (mix * delayDataLeft[readIndex]));
// Write the current input into the delay buffer along with the delayed sample
delayDataLeft[writeIndex] = channelDataLeft[i] + (delayDataLeft[readIndex] * feedback);
delayDataRight[writeIndex] = channelDataRight[i] + (delayDataRight[readIndex] * feedback);
// Increment read and write index, check to see if it's greater than buffer length
// if yes, wrap back around to zero
if (++readIndex >= delayBufferLength)
readIndex = 0;
if (++writeIndex >= delayBufferLength)
writeIndex = 0;
// Assign output sample computed above to the output buffer
channelDataLeft[i] = outputSampleLeft;
channelDataRight[i] = outputSampleRight;
}
}
Not really sure why you have the modulo one and different behavior based on sample index. A ping-pong delay should have two delay buffers, one for each channel. The input of one stereo channel plus the feedback of the opposite channel's delay buffer should be be fed into each delay.
Here is a good image of the audio signal graph of it:
Here is some pseudo-code of the logic:
float wetDryMix = 0.5f;
float wetFactor = wetDryMix;
float dryFactor = 1.0f - wetDryMix;
float feedback = 0.6f;
int sampleRate = 44100;
int sampleCount = sampleRate * 10;
float[] leftInSamples = new float[sampleCount];
float[] rightInSamples = new float[sampleCount];
float[] leftOutSamples = new float[sampleCount];
float[] rightOutSamples = new float[sampleCount];
int delayBufferSize = sampleRate * 3;
float[] delayBufferLeft = new float[delayBufferSize];
float[] delayBufferRight = new float[delayBufferSize];
int delaySamples = sampleRate / 2;
int delayReadIndex = 0;
int delayWriteIndex = delaySamples;
for(int sampleIndex = 0; sampleIndex < sampleCount; sampleIndex++) {
//Read samples in from input
leftChannel = leftInSamples[sampleIndex];
rightChannel = rightInSamples[sampleIndex];
//Make sure delay ring buffer indices are within range
delayReadIndex = delayReadIndex % delayBufferSize;
delayWriteIndex = delayWriteIndex % delayBufferSize;
//Get the current output of delay ring buffer
float delayOutLeft = delayBufferLeft[delayReadIndex];
float delayOutRight = delayBufferRight[delayReadIndex];
//Calculate what is put into delay buffer. It is the current input signal plus the delay output attenuated by the feedback factor
//Notice that the right delay output is fed into the left delay and vice versa
//In this version sound from each stereo channel will ping pong back and forth
float delayInputLeft = leftChannel + delayOutRight * feedback;
float delayInputRight = rightChannel + delayOutLeft * feedback;
//Alternatively you could use a mono signal that is pushed to one delay channel along with the current feedback delay
//This will ping-pong a mixed mono signal between channels
//float delayInputLeft = leftChannel + rightChannel + delayOutRight * feedback;
//float delayInputRight = delayOutLeft * feedback;
//Push the calculated delay value into the delay ring buffers
delayBufferLeft[delayWriteIndex] = delayInputLeft;
delayBufferRight[delayWriteIndex] = delayInputRight;
//Calculate resulting output by mixing the dry input signal with the current delayed output
float outputLeft = leftChannel * dryFactor + delayOutLeft * wetFactor;
float outputRight = rightChannel * dryFactor + delayOutRight * wetFactor;
leftOutSamples[sampleIndex] = outputLeft;
rightOutSamples[sampleIndex] = outputRight;
//Increment ring buffer indices
delayReadIndex++;
delayWriteIndex++;
}

How to set variable FPS in libx264 and what encoder parameters to use?

I'm trying to encode a webcam frames with libx264 in realtime, and face with one problem - the resulting video length is exactly what I set, but camera is delays somtimes and the real capture time is more than video length. As a result the picture in video changes to fast.I think it is due to constant FPS in x264 settings, so I need to make it dynamic somehow. Is it possible? If I wrong about FPS, so what I need to do, to synchronize capturing and writing?
Also I would like to know what are the optimal encoder parameters for streaming via internet and for recording to disk (the client is streaming from camera or screen, and the server is recording)?
Here is console logs screenshot and my code:
#include <stdint.h>
#include "stringf.h"
#include "Capture.h"
#include "x264.h"
int main( int argc, char **argv ){
Camera instance;
if(!instance.Enable(0)){printf("Camera not available\n");return 1;}
// Initializing metrics and buffer of frame
unsigned int width, height, size = instance.GetMetrics(width, height);
unsigned char *data = (unsigned char *)malloc(size);
// Setting encoder (I'm not sure about all parameters)
x264_param_t param;
x264_param_default_preset(&param, "ultrafast", "zerolatency");
param.i_threads = 1;
param.i_width = width;
param.i_height = height;
param.i_fps_num = 20;
param.i_fps_den = 1;
// Intra refres:
param.i_keyint_max = 8;
param.b_intra_refresh = 1;
// Rate control:
param.rc.i_rc_method = X264_RC_CRF;
param.rc.f_rf_constant = 25;
param.rc.f_rf_constant_max = 35;
// For streaming:
param.b_repeat_headers = 1;
param.b_annexb = 1;
x264_param_apply_profile(&param, "baseline");
x264_t* encoder = x264_encoder_open(&param);
int seconds, expected_time, operation_start, i_nals, frame_size, frames_count;
expected_time = 1000/param.i_fps_num;
operation_start = 0;
seconds = 1;
frames_count = param.i_fps_num * seconds;
int *Timings = new int[frames_count];
x264_picture_t pic_in, pic_out;
x264_nal_t* nals;
x264_picture_alloc(&pic_in, X264_CSP_I420, param.i_width, param.i_height);
// Capture-Encode-Write loop
for(int i = 0; i < frames_count; i++){
operation_start = GetTickCount();
size = instance.GrabBGR(&data);
instance.BGRtoI420(data, &pic_in.img.plane[0], &pic_in.img.plane[1], &pic_in.img.plane[2], param.i_width, param.i_height);
frame_size = x264_encoder_encode(encoder, &nals, &i_nals, &pic_in, &pic_out);
if( frame_size > 0){
stringf::WriteBufferToFile("test.h264",std::string(reinterpret_cast<char*>(nals->p_payload), frame_size),1);
}
Timings[i] = GetTickCount() - operation_start;
}
while( x264_encoder_delayed_frames( encoder ) ){ // Flush delayed frames
frame_size = x264_encoder_encode(encoder, &nals, &i_nals, NULL, &pic_out);
if( frame_size > 0 ){stringf::WriteBufferToFile("test.h264",std::string(reinterpret_cast<char*>(nals->p_payload), frame_size),1);}
}
unsigned int total_time = 0;
printf("Expected operation time was %d ms per frame at %u FPS\n",expected_time, param.i_fps_num);
for(unsigned int i = 0; i < frames_count; i++){
total_time += Timings[i];
printf("Frame %u takes %d ms\n",(i+1), Timings[i]);
}
printf("Record takes %u ms\n",total_time);
free(data);
x264_encoder_close( encoder );
x264_picture_clean( &pic_in );
return 0;
}
The capture takes 1453 ms and the output file plays exactly 1 sec.
So, in general, the video length must be the same as a capture time, but not as encoder "wants".How to do it?

Non-audible videos with libwebm (VP8/Opus) -- Syncing audio --

I am trying to create a very simple webm(vp8/opus) encoder, however I can not get the audio to work.
ffprobe does detect the file format and duration
Stream #1:0(eng): Audio: opus, 48000 Hz, mono, fltp (default)
The video can be played just fine in VLC and Chrome, but with no audio, for some reason the audio input bitrate is always 0
Most of the audio encoding code was copied from
https://github.com/fnordware/AdobeWebM/blob/master/src/premiere/WebM_Premiere_Export.cpp
Here is the relevant code:
static const long long kTimeScale = 1000000000LL;
MkvWriter writer;
writer.Open("video.webm");
Segment mux_seg;
mux_seg.Init(&writer);
// VPX encoding...
int16_t pcm[SAMPLES];
uint64_t audio_track_id = mux_seg.AddAudioTrack(SAMPLE_RATE, 1, 0);
mkvmuxer::AudioTrack *audioTrack = (mkvmuxer::AudioTrack*)mux_seg.GetTrackByNumber(audio_track_id);
audioTrack->set_codec_id(mkvmuxer::Tracks::kOpusCodecId);
audioTrack->set_seek_pre_roll(80000000);
OpusEncoder *encoder = opus_encoder_create(SAMPLE_RATE, 1, OPUS_APPLICATION_AUDIO, NULL);
opus_encoder_ctl(encoder, OPUS_SET_BITRATE(64000));
opus_int32 skip = 0;
opus_encoder_ctl(encoder, OPUS_GET_LOOKAHEAD(&skip));
audioTrack->set_codec_delay(skip * kTimeScale / SAMPLE_RATE);
mux_seg.CuesTrack(audio_track_id);
uint64_t currentAudioSample = 0;
uint64_t opus_ts = 0;
while(has_frame) {
int bytes = opus_encode(encoder, pcm, SAMPLES, out, SAMPLES * 8);
opus_ts = currentAudioSample * kTimeScale / SAMPLE_RATE;
mux_seg.AddFrame(out, bytes, audio_track_id, opus_ts, true);
currentAudioSample += SAMPLES;
}
opus_encoder_destroy(encoder);
mux_seg.Finalize();
writer.Close();
Update #1:
It seems that the problem is that WebM requires the audio and video tracks to be interlaced.
However I can not figure out how to sync the audio.
Should I calculate the frame duration, then encode the equivalent audio samples?
The problem was that I was missing the OGG header data, and the audio frames timestamps were not accurate.
to complete the answer here is the pseudo code for the encoder.
const int kTicksPerSecond = 1000000000; // webm timescale
const int kTimeScale = kTicksPerSecond / FPS;
const int kTwoNanoSeconds = 1000000000;
init_opus_encoder();
audioTrack->set_seek_pre_roll(80000000);
audioTrack->set_codec_delay(opus_preskip);
audioTrack->SetCodecPrivate(ogg_header_data, ogg_header_size);
while(has_video_frame) {
encode_vpx_frame();
video_pts = frame_index * kTimeScale;
muxer_segment.addFrame(frame_packet_data, packet_length, video_track_id, video_pts, packet_flags);
// fill the video frames gap with OPUS audio samples
while(audio_pts < video_pts + kTimeScale) {
encode_opus_frame();
muxer_segment.addFrame(opus_frame_data, opus_frame_data_length, audio_track_id, audio_pts, true /* keyframe */);
audio_pts = curr_audio_samples * kTwoNanoSeconds / 48000;
curr_audio_samples += 960;
}
}

How to record the microphone untill there is no sound?

I've created 2 functions :
- One that records the microphone
- One that plays the sound of the microphone
It records the microphone for 3 seconds
#include <iostream>
#include <Windows.h>
#include <vector>
using namespace std;
#pragma comment(lib, "winmm.lib")
short int waveIn[44100 * 3];
void PlayRecord();
void StartRecord()
{
const int NUMPTS = 44100 * 3; // 3 seconds
int sampleRate = 44100;
// 'short int' is a 16-bit type; I request 16-bit samples below
// for 8-bit capture, you'd use 'unsigned char' or 'BYTE' 8-bit types
HWAVEIN hWaveIn;
MMRESULT result;
WAVEFORMATEX pFormat;
pFormat.wFormatTag=WAVE_FORMAT_PCM; // simple, uncompressed format
pFormat.nChannels=1; // 1=mono, 2=stereo
pFormat.nSamplesPerSec=sampleRate; // 44100
pFormat.nAvgBytesPerSec=sampleRate*2; // = nSamplesPerSec * n.Channels * wBitsPerSample/8
pFormat.nBlockAlign=2; // = n.Channels * wBitsPerSample/8
pFormat.wBitsPerSample=16; // 16 for high quality, 8 for telephone-grade
pFormat.cbSize=0;
// Specify recording parameters
result = waveInOpen(&hWaveIn, WAVE_MAPPER,&pFormat,
0L, 0L, WAVE_FORMAT_DIRECT);
WAVEHDR WaveInHdr;
// Set up and prepare header for input
WaveInHdr.lpData = (LPSTR)waveIn;
WaveInHdr.dwBufferLength = NUMPTS*2;
WaveInHdr.dwBytesRecorded=0;
WaveInHdr.dwUser = 0L;
WaveInHdr.dwFlags = 0L;
WaveInHdr.dwLoops = 0L;
waveInPrepareHeader(hWaveIn, &WaveInHdr, sizeof(WAVEHDR));
// Insert a wave input buffer
result = waveInAddBuffer(hWaveIn, &WaveInHdr, sizeof(WAVEHDR));
// Commence sampling input
result = waveInStart(hWaveIn);
cout << "recording..." << endl;
Sleep(3 * 1000);
// Wait until finished recording
waveInClose(hWaveIn);
PlayRecord();
}
void PlayRecord()
{
const int NUMPTS = 44100 * 3; // 3 seconds
int sampleRate = 44100;
// 'short int' is a 16-bit type; I request 16-bit samples below
// for 8-bit capture, you'd use 'unsigned char' or 'BYTE' 8-bit types
HWAVEIN hWaveIn;
WAVEFORMATEX pFormat;
pFormat.wFormatTag=WAVE_FORMAT_PCM; // simple, uncompressed format
pFormat.nChannels=1; // 1=mono, 2=stereo
pFormat.nSamplesPerSec=sampleRate; // 44100
pFormat.nAvgBytesPerSec=sampleRate*2; // = nSamplesPerSec * n.Channels * wBitsPerSample/8
pFormat.nBlockAlign=2; // = n.Channels * wBitsPerSample/8
pFormat.wBitsPerSample=16; // 16 for high quality, 8 for telephone-grade
pFormat.cbSize=0;
// Specify recording parameters
waveInOpen(&hWaveIn, WAVE_MAPPER,&pFormat, 0L, 0L, WAVE_FORMAT_DIRECT);
WAVEHDR WaveInHdr;
// Set up and prepare header for input
WaveInHdr.lpData = (LPSTR)waveIn;
WaveInHdr.dwBufferLength = NUMPTS*2;
WaveInHdr.dwBytesRecorded=0;
WaveInHdr.dwUser = 0L;
WaveInHdr.dwFlags = 0L;
WaveInHdr.dwLoops = 0L;
waveInPrepareHeader(hWaveIn, &WaveInHdr, sizeof(WAVEHDR));
HWAVEOUT hWaveOut;
cout << "playing..." << endl;
waveOutOpen(&hWaveOut, WAVE_MAPPER, &pFormat, 0, 0, WAVE_FORMAT_DIRECT);
waveOutWrite(hWaveOut, &WaveInHdr, sizeof(WaveInHdr)); // Playing the data
Sleep(3 * 1000); //Sleep for as long as there was recorded
waveInClose(hWaveIn);
waveOutClose(hWaveOut);
}
int main()
{
StartRecord();
return 0;
}
How can I change my StartRecord function (and I guess my PlayRecord function aswell), to make it record untill theres no input from the microphone?
(So far, those 2 functions are working perfectly - records the microphone for 3 seconds, then plays the recording)...
Thanks!
Edit: by no sound, I mean the sound level is too low or something (means the person probably isnt speaking)...
Because sound is a wave, it oscillates between high and low pressures. This waveform is usually recorded as positive and negative numbers, with zero being the neutral pressure. If you take the absolute value of the signal and keep a running average it should be sufficient.
The average should be taken over a long enough period that you account for the appropriate amount of silence. A very cheap way to keep an estimate of the running average is like this:
const double threshold = 50; // Whatever threshold you need
const int max_samples = 10000; // The representative running average size
double average = 0; // The running average
int sample_count = 0; // When we are building the average
while( sample_count < max_samples || average > threshold ) {
// New sample arrives, stored in 'sample'
// Adjust the running absolute average
if( sample_count < max_samples ) sample_count++;
average *= double(sample_count-1) / sample_count;
average += std::abs(sample) / sample_count;
}
The larger max_samples, the slower average will respond to a signal. After the sound stops, it will slowly trail off. However, it will be slow to rise again too. This would be fine for reasonably continuous sound.
With something like speech, which can have short or long pauses, you may want to use an impulse-based approach. You can just define the number of samples of 'silence' that you expect, and reset it whenever you receive an impulse that exceeds the threshold. Using the running average above with a much shorter window size will give you a simple way of detecting an impulse. Then you just need to count...
const int max_samples = 100; // Smaller window size for impulse
const int max_silence_samples = 10000; // Maximum samples below threshold
int silence = 0; // Number of samples below threshold
while( silence < max_silence_samples ) {
// Compute running average as before
//...
// Check for silence. If there's a signal, reset the counter.
if( average > threshold ) silence = 0;
else ++silence;
}
Adjusting threshold and max_samples will control the sensitivity to pops and clicks, while max_silence_samples gives you control over how much silence is allowed before you stop recording.
There are undoubtedly more technical ways to achieve your goals, but it's always good to try the simple one first. See how you go with this.
I suggest you to do it via DirectShow. You should create an instance of microphone, SampleGrabber, audio encoder and file writer. Your graph should be like this:
Microphone -> SampleGrabber -> Audio Encoder -> File Writer
Every sample passes through SampleGrabber and you can read all raw samples and check if you should continue record or not. This is the best way you and both record and check it's contents.

Resample audio using libsamplerate in windows phone

I'm trying to re-sample captured 2channel/48khz/32bit audio to 1channel/8khz/32bit using libsamplerate in a windows phone project using WASAPI.
I need to get 160 frames from 960 original frames by re-sampling.After capturing audio using GetBuffer method I send the captured BYTE array of 7680 byte to the below method:
void BackEndAudio::ChangeSampleRate(BYTE* buf)
{
int er2;
st=src_new(2,1,&er2);
//SRC_DATA sd defined before
sd=new SRC_DATA;
BYTE *onechbuf = new BYTE[3840];
int outputIndex = 0;
//convert Stereo to Mono
for (int n = 0; n < 7680; n+=8)
{
onechbuf[outputIndex++] = buf[n];
onechbuf[outputIndex++] = buf[n+1];
onechbuf[outputIndex++] = buf[n+2];
onechbuf[outputIndex++] = buf[n+3];
}
float *res1=new float[960];
res1=(float *)onechbuf;
float *res2=new float[160];
//change samplerate
sd->data_in=res1;
sd->data_out=res2;
sd->input_frames=960;
sd->output_frames=160;
sd->src_ratio=(double)1/6;
sd->end_of_input=1;
int er=src_process(st,sd);
transportController->WriteAudio((BYTE *)res2,640);
delete[] onechbuf;
src_delete(st);
delete sd;
}
src_process method returns no error and sd->input_frames_used set to 960 and sd->output_frames_gen set to 159 but the rendering output is only noise.
I use the code in a real-time VoIP app.
What could be the source of problem ?
I found the problem.I shouldn't make a new SRC_STATE object and delete it in each call of my function by calling st=src_new(2,1,&er2); and src_delete(st);but call them once is enough for the whole audio re-sampling.Also there is no need to using pointer for the SRC_DATA . I modified my code as below and it works fine now.
void BackEndAudio::ChangeSampleRate(BYTE* buf)
{
BYTE *onechbuf = new BYTE[3840];
int outputIndex = 0;
//convert Stereo to Mono
for (int n = 0; n < 7680; n+=8)
{
onechbuf[outputIndex++] = buf[n];
onechbuf[outputIndex++] = buf[n+1];
onechbuf[outputIndex++] = buf[n+2];
onechbuf[outputIndex++] = buf[n+3];
}
float *out=new float[160];
//change samplerate
sd.data_in=(float *)onechbuf;
sd.data_out=out;
sd.input_frames=960;
sd.output_frames=160;
sd.src_ratio=(double)1/6;
sd.end_of_input=0;
int er=src_process(st,&sd);
}