How to reduce latency when streaming x264 - c++

I would like to produce a zerolatency live video stream and play it in VLC player with as little latency as possible.
This are the settings I currently use:
x264_param_default_preset( &m_Params, "veryfast", "zerolatency" );
m_Params.i_threads = 2;
m_Params.b_sliced_threads = true;
m_Params.i_width = m_SourceWidth;
m_Params.i_height = m_SourceHeight;
m_Params.b_intra_refresh = 1;
m_Params.b_vfr_input = true;
m_Params.i_timebase_num = 1;
m_Params.i_timebase_den = 1000;
m_Params.i_fps_num = 1;
m_Params.i_fps_den = 60;
m_Params.rc.i_vbv_max_bitrate = 512;
m_Params.rc.i_vbv_buffer_size = 256;
m_Params.rc.f_vbv_buffer_init = 1.1f;
m_Params.rc.i_rc_method = X264_RC_CRF;
m_Params.rc.f_rf_constant = 24;
m_Params.rc.f_rf_constant_max = 35;
m_Params.b_annexb = 0;
m_Params.b_repeat_headers = 0;
m_Params.b_aud = 0;
x264_param_apply_profile( &m_Params, "high" );
Using those settings, I have the following issues:
VLC shows lots of missing frames (see screenshot, "verloren"). I am not sure if this is an issue.
If I set a value <200ms for the network stream delay in VLC, VLC renders a few frames and than stops to decode/render frames.
If I set a value >= 200ms for the network stream delay in VLC, everything looks good so far but the latency is, obviously, 200ms, which is too high.
Question:
Which settings (x264lib and VLC) should I use in order to encode and stream with as little latency as possible?

On your x264 settings: many are redundant ie already contained in "zerolatency". However, as best as I can tell, your encoding latency is nevertheless zero frames, ie you put one frame in and you immediately (as soon as your CPU has finished encoding it, anyway) get one frame out. It never waits for a newer frame in order to give an encoded older frame (the way it would with lookahead, for example).
On why vlc pauses unless you give it a large network delay: The problem is that your combination of rate control and vbv settings when encoding is not ideal. What you want to do for low latency encode is to use CBR, and set the VBV buffer to the size of one frame, exactly. This enables a special VBV calculation, if you look in the x264 source.
You may also try not setting anything timing related whatsoever (no fps, no vbv) and use CRF with zerolatency. The results would depend on what container the video is packaged in for streaming.
Read this for more info: http://x264dev.multimedia.cx/archives/249

If you want to have the fastest possible encoding, then delete everything after
x264_param_default_preset( &m_Params, "veryfast", "zerolatency" );
and change veryfast to ultrafast. The rest is because of network delay + decoding.

Related

WASAPI: Identify non-active channels on loopback recording

I have a DSP software which captures the audio playing using the WASAPI api in shared loopback mode.
hr = _pAudioClient->Initialize(AUDCLNT_SHAREMODE_SHARED, AUDCLNT_STREAMFLAGS_LOOPBACK, 0, 0, _pFormat, 0);
This part works fine, but now I want to be able to detect the number of channels actually playing. In other words how would I be able to detect if the audio playing is in stereo, 5.1, 7.1?
The problem is:
* Since loopback have to use shared mode there could be multiple sources playing.
* This analysis has to be done in real-time. Can't wait until playback is done.
* Detect the difference between a channel not used at all by any playback source and a channel that is temporarily silent
The best solution in my mind would be If I could retrieve a list of all playback source/sub mixes and query them each for the number of channels. That way I don't have to analyse the audio data stream itself.
Loopback recording takes place in mix format defined on the endpoint, so regardless of what the original audio format was you get the data in the mix format, mixed from possibly multiple played sources and also converted to such shared format.
Device Formats
Loopback Recording
WASAPI loopback contains the mix of all audio being played...
The GetMixFormat method retrieves the stream format that the audio engine uses for its internal processing of shared-mode streams...
After an application has used GetMixFormat or IsFormatSupported to find an appropriate format for a shared-mode or exclusive-mode stream, the application can call the Initialize method to initialize a stream with that format. An application that attempts to initialize a shared-mode stream with a format that is not identical to the mix format obtained from the GetMixFormat method, but that has the same number of channels and the same sample rate as the mix format, is likely to succeed. Before calling Initialize, the application can call IsFormatSupported to verify that Initialize will accept the format.
That is, even though WASAPI offers some flexibility in audio format, channel configuration and sample rate are defined by shared format when it comes to loopback capture.
As you are getting the mix, you cannot really identify "non-active" channels: this information is lost during mixing to shared format.
Also, the actual shared format can be configured interactively via Control Panel:
Ok I now have a solution to my problem. As far as I know you can not detect sub-mixes in the shared mix so the only option was to analyze the audio stream/capture buffer.
First during my main capture loop I set the current timestamp for all channels playing.
const time_t now = Date::getCurrentTimeMillis();
//Iterate all capture frames
for (i = 0; i < numFramesAvailable; ++i) {
for (j = 0; j < _nChannelsIn; ++j) {
//Identify which channels are playing.
if (pCaptureBuffer[j] != 0) {
_pUsedChannels[j] = now;
}
}
}
Then every second I call this function which evaluates if a channel has played the last second. Based upon which channels are playing I can do conditional routing.
void checkUsedChannels() {
const time_t now = Date::getCurrentTimeMillis();
//Compare now against last used timestamp and determine active channels
for (size_t i = 0; i < _nChannelsIn; ++i) {
if (now - _pUsedChannels[i] > 1000) {
_pUsedChannels[i] = 0;
}
}
//Update conditional routing
for (const Input *pInut : _inputs) {
pInut->evalConditions();
}
}
Very simple solution but it appears to be working.

FFMPEG:av_rescale_q - time_base difference

I want to know once and for all, how time base calucaltion and rescaling works in FFMPEG.
Before getting to this question I did some research and found many controversial answers, which make it even more confusing.
So based on official FFMPEG examples one has to
rescale output packet timestamp values from codec to stream timebase
with something like this:
pkt->pts = av_rescale_q_rnd(pkt->pts, *time_base, st->time_base, AV_ROUND_NEAR_INF|AV_ROUND_PASS_MINMAX);
pkt->dts = av_rescale_q_rnd(pkt->dts, *time_base, st->time_base, AV_ROUND_NEAR_INF|AV_ROUND_PASS_MINMAX);
pkt->duration = av_rescale_q(pkt->duration, *time_base, st->time_base);
But in this question a guy was asking similar question to mine, and he gave more examples, each of them doing it differently. And contrary to the answer which says that all those ways are fine, for me only the following approach works:
frame->pts += av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
In my application I am generating video packets (h264) at 60 fps outside FFMPEG API then write them into mp4 container.
I set explicitly:
video_st->time_base = {1,60};
video_st->r_frame_rate = {60,1};
video_st->codec->time_base = {1 ,60};
The first weird thing I see happens right after I have written header for the output format context:
AVDictionary *opts = nullptr;
int ret = avformat_write_header(mOutputFormatContext, &opts);
av_dict_free(&opts);
After that ,video_st->time_baseis populated with:
num = 1;
den = 15360
And I fail to understand why.
I want someone please to exaplain me that.Next, before writing frame I calculate
PTS for the packet. In my case PTS = DTS as I don't use B-frames at all.
And I have to do this:
const int64_t duration = av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
totalPTS += duration; //totalPTS is global variable
packet->pts = totalPTS ;
packet->dts = totalPTS ;
av_write_frame(mOutputFormatContext, mpacket);
I don't get it,why codec and stream have different time_base values even though I explicitly set those to be the same. And because I see across all the examples that av_rescale_q is always used to calculate duration I really want someone to explain this point.
Additionally, as a comparison, and for the sake of experiment, I decided to try writing stream for WEBM container. So I don't use libav output stream at all.
I just grab the same packet I use to encode MP4 and write it manually into EBML stream. In this case I calculate duration like this:
const int64_t duration =
( video_st->codec->time_base.num / video_st->codec->time_base.den) * 1000;
Multiplication by 1000 is required for WEBM as the time stamps are presented in milliseconds in that container.And this works. So why in case of MP4 stream encoding there is a difference in time_base which has to be rescaled?
This behavior from ffmpeg confuses me too. It was discussed a little by users here - http://ffmpeg.org/pipermail/libav-user/2018-January/010843.html . But the resolution there was to just deal with the 15360 time_base rather than exert control over it.
From the source pointed out by the poster in that forum topic (https://github.com/FFmpeg/FFmpeg/blob/master/libavformat/movenc.c, search for "*= 2"), it doesn't look easily avoidable as far as I can tell. It appears your choice is to let the time_base get changed, or to pick something >= 10000 and then it will not be changed.

ffmpeg H264 Encode Frame at a time for network streaming

I'm working on a remote desktop application, I would like to send an encoded H264 packet over TCP by using ffmpeg for the encoding. However I couldn't find useful info for the particular case of encoding just one frame (already on YUV444) and get the packet.
I have several issues, the first was that:
avcodec_encode_video2
Was not blocking, I found that most of the time you get the "delayed" frames at the end, however, since this is a real time streaming the solution was:
av_opt_set(mCodecContext->priv_data, "tune", "zerolatency", 0);
Now I got the frame, but several issues, it takes a while and even worse I got a gray with trash pixels video as result. My configuration for the Codec Context:
m_pCodecCtx->bit_rate=8000000;
m_pCodecCtx->codec_id=AV_CODEC_ID_H264;
m_pCodecCtx->codec_type = AVMEDIA_TYPE_VIDEO;
m_pCodecCtx->width=1920;
m_pCodecCtx->height=1080;
m_pCodecCtx->pix_fmt=AV_PIX_FMT_YUV444P;
m_pCodecCtx->time_base.num = 1;
m_pCodecCtx->time_base.den = 25;
m_pCodecCtx->gop_size = 1;
m_pCodecCtx->keyint_min = 1;
m_pCodecCtx->i_quant_factor = float(0.71);
m_pCodecCtx->b_frame_strategy = 20;
m_pCodecCtx->qcompress = (float)0.6;
m_pCodecCtx->qmax = 51;
m_pCodecCtx->qmin = 20;
m_pCodecCtx->max_qdiff = 4;
m_pCodecCtx->refs = 4;
m_pCodecCtx->max_b_frames = 1;
m_pCodecCtx->thread_count = 1;
I would like to know how this could be done, how do I set the "I Frames"? and, that would be the optimal for a "one at a time" encoding? Also I'm not concerned right now with the quality, just need to be fast enough (under 16 ms).
For the encoding part:
nres = avcodec_encode_video2(m_pCodecCtx,&packet,m_pFrame,&framefinished);
if(nres<0){
qDebug() << "error encoding: " << nres << endl;
}
if(framefinished){
m_pFrame->pts++;
ofstream vidout("video.h264",ios::app);
if(vidout.good()){
vidout.write((const char*)&packet.data[0],packet.size);
}
vidout.close();
av_packet_unref(&packet);
}
I'm not using a container, just a raw file, ffplay reproduce raw files if the packets are right, and that's my principal issue. I'm planning to send the packet over tcp and decode on the client. Any help would be greatly appreciated.
You could take a look at the source code of webrtc.
It use openh264 and ffmpeg to accomplish your work.
I was study in it for a while. But I can't the the latest source code currently.
I found this :
source code.
Hope it helps.
Turns out I got it working since the beginning, I made very simple but important mistake, I was writing as text a binary file, so...
Thanks for the feedback and your help

Can a high-performance jpeglib-turbo implmentation decompress/compress in <100ms?

I'm currently implementing a jpeg resizer in C++ using the jpeglib-turbo library.
I've been given the target of 100 milli-seconds for JPEG decompression and recompression using the library. The best I can come up with using the recommended optimisation settings (documented in jpeglib-turbo usage.txt) is around 320ms, so I'm wondering is 100ms even possible/realistic? This would be to decompress/recompress an image of 3000x4000 px from around 6Mb in size to 130Kb.
The code that I'm using for fast decompression is:
dinfo.dct_method = JDCT_IFAST;
dinfo.do_fancy_upsampling = FALSE;
dinfo.two_pass_quantize = FALSE;
dinfo.dither_mode = JDITHER_ORDERED;
dinfo.scale_num = 1/8;
Thanks for the answers.
It is actually possible to decompress and re-compress in around 100ms. After contacting the writer of libjpeg-turbo he told me that the dinfo.scale_num property I was using was wrong. This property is the scale numerator - I also needed to set the scale_denom (denominator) property.
So the good code would be:
dinfo.dct_method = JDCT_IFAST;
dinfo.do_fancy_upsampling = FALSE;
dinfo.two_pass_quantize = FALSE;
dinfo.dither_mode = JDITHER_ORDERED;
dinfo.scale_num = 1;
dinfo.scale_denom = 8;
I want the code to be so fast as the image scaling should be imperceptible for the user as it's in a client application where speed/user-experience is the most important thing.

How to use ALSA's snd_pcm_writei()?

Can someone explain how snd_pcm_writei
snd_pcm_sframes_t snd_pcm_writei(snd_pcm_t *pcm, const void *buffer,
snd_pcm_uframes_t size)
works?
I have used it like so:
for (int i = 0; i < 1; i++) {
f = snd_pcm_writei(handle, buffer, frames);
...
}
Full source code at http://pastebin.com/m2f28b578
Does this mean, that I shouldn't give snd_pcm_writei() the number of
all the frames in buffer, but only
sample_rate * latency = frames
?
So if I e.g. have:
sample_rate = 44100
latency = 0.5 [s]
all_frames = 100000
The number of frames that I should give to snd_pcm_writei() would be
sample_rate * latency = frames
44100*0.5 = 22050
and the number of iterations the for-loop should be?:
(int) 100000/22050 = 4; with frames=22050
and one extra, but only with
100000 mod 22050 = 11800
frames?
Is that how it works?
Louise
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html#gf13067c0ebde29118ca05af76e5b17a9
frames should be the number of frames (samples) you want to write from the buffer. Your system's sound driver will start transferring those samples to the sound card right away, and they will be played at a constant rate.
The latency is introduced in several places. There's latency from the data buffered by the driver while waiting to be transferred to the card. There's at least one buffer full of data that's being transferred to the card at any given moment, and there's buffering on the application side, which is what you seem to be concerned about.
To reduce latency on the application side you need to write the smallest buffer that will work for you. If your application performs a DSP task, that's typically one window's worth of data.
There's no advantage in writing small buffers in a loop - just go ahead and write everything in one go - but there's an important point to understand: to minimize latency, your application should write to the driver no faster than the driver is writing data to the sound card, or you'll end up piling up more data and accumulating more and more latency.
For a design that makes producing data in lockstep with the sound driver relatively easy, look at jack (http://jackaudio.org/) which is based on registering a callback function with the sound playback engine. In fact, you're probably just better off using jack instead of trying to do it yourself if you're really concerned about latency.
I think the reason for the "premature" device closure is that you need to call snd_pcm_drain(handle); prior to snd_pcm_close(handle); to ensure that all data is played before the device is closed.
I did some testing to determine why snd_pcm_writei() didn't seem to work for me using several examples I found in the ALSA tutorials and what I concluded was that the simple examples were doing a snd_pcm_close () before the sound device could play the complete stream sent it to it.
I set the rate to 11025, used a 128 byte random buffer, and for looped snd_pcm_writei() for 11025/128 for each second of sound. Two seconds required 86*2 calls snd_pcm_write() to get two seconds of sound.
In order to give the device sufficient time to convert the data to audio, I put used a for loop after the snd_pcm_writei() loop to delay execution of the snd_pcm_close() function.
After testing, I had to conclude that the sample code didn't supply enough samples to overcome the device latency before the snd_pcm_close function was called which implies that the close function has less latency than the snd_pcm_write() function.
If the ALSA driver's start threshold is not set properly (if in your case it is about 2s), then you will need to call snd_pcm_start() to start the data rendering immediately after snd_pcm_writei().
Or you may set appropriate threshold in the SW params of ALSA device.
ref:
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m___s_w___params.html