openAL - choppy sound when playing the buffers - c++

I have a voice chat which receives rtp packets (each packet contains 20ms of voice afaik), adds them to a buffer and plays it out.
If I call alSourcePlay() directly after buffering a packet(I have 5 buffers and each buffer gets one packet, which are then re-used once the packets are played), the sound will be 'choppy' since it will play out the buffer before another packet arrives.
My question is how do you deal with this so that audio isn't played as choppy?

If you are, on average, getting less than 50 20ms packets per minute then there has to be pauses somewhere. If you store the packets for a while before playing them, then you can look for natural pauses ( silence ) and combine the gaps with the natural pauses so things sound more natural. The more you store the better playback will sound, but do it too much and the delay will become unpleasant.
The amount of buffering you need is a matter of taste. Which is uglier, a choppy sound or a delayed response. I guess you will have to design it so it is a variable and and then experiment to find the 'happy medium'
If you are short, at a maximum, of 10 packets per second, then a simpler scheme suggests itself: Place a delay of 4ms between each packet, which should be undetectable. Run for 1 second. See how many packets have accumulated ( if you only go 40 packets, this would be zero ) Adjust the inter-packet delay to compensate. Continue.

Related

How to play audio stream over UDP?

I writing a Windows application, It receives audio data from an Android app, I use UDP to transfer data over LAN, and use RtAudio to play audio-stream.
Every UDP package payload is a audio sample array, in 32k/16bit/pcm format.
When data size is 576 bytes, 288 samples in other words, every thing is OK, we can hear a clear voice.
But when data size in 192 bytes, 96 samples in other words, the sound is not clear.
Does anyone have the problem?
It is a balancing act to determine optimum size of each buffer packet ... too large and you progressively move away from real time response yet too small and the code spends proportionately too much time negotiating the boilerplate plumbing of simply transferring the data. Looks like you have hit this lower boundary when as you say 192 bytes starts acting up.
This is true independent of transport mechanism. Also keep in mind the wall clock duration consumed when listening to a few hundred bytes is tiny (typically 44,100 samples per second for CD quality mono audio) so you will not loose much in the real time aspect to give yourself more than that lower bound you have hit.

GOP size for realtime video stream

I'm working on a kind of rich remote desktop system, with a video stream of the desktop encoded using avcodec/x264. I have to set manually the GOP size for the stream, and so far I was using a size of fps/2.
But I've just read the following on Wikipedia:
This structure [Group Of Picture# suggests a problem because the fourth frame (a P-frame) is needed in order to predict the second and the third (B-frames). So we need to transmit the P-frame before the B-frames and it will delay the transmission (it will be necessary to keep the P-frame).
It means I'm creating a lot of latency since the client needs to receive at least half of the GOP to output the first frame following the I frame. What is the best strategy for the GOP size if I want the smallest latency possible ? A gop of 1 picture ?
If you want to minimize latency with h264, you should generally avoid b-frames. This way the decoder has at least a chance to emit decoded frames early. This prevents decoder-induced latency.
You may also want to tune the encoder for latency, by reducing/disabling look-ahead. x264 has a "zero-latency" setting which should be a good starting point for finding you optimal settings.
The "GOP" size (which afaik is not really defined for h264; I'll just assume you mean the I(DR)-frame interval) does not necessarily affect the latency. This parameter only affects how long a client will have to wait until it can "sync" on the stream (time-to-first-picture).

Are there any constraints to encode a audio signal?

I capture a pcm sound at some sampling rate, e.g. 24 kHz. I need to encode it using some codec (I use Opus for that) to send over network. I noticed that at some sampling rate I use for encoding with Opus, I often hear some extra "cracking" noise at the receiving end. At other rates, it sounds ok. That might be an implementation bug, but I though there might be some constraints also that I don't know.
I also noticed that if I use another sampling rate while decoding Opus-encoded audio stream, I get a lower or higher pitch of sound, which seems logical to me. So I've read, that I need to resample on the other end, if receiving side doesn't support the original PCM sampling rate.
So I have 2 questions regarding all this:
Are there any constraints on sampling rate (or other parameters) of audio encoding? (Like I have a 24kHz pcm sound - maybe there are certain sample rates to use with it?)
Are there any common techniques to provide the same sound quality at both sides when sending audio stream over network?
The crackling noises are most likely a bug, since there is no limitations to the samplerate that would result in this kind of noise (there are other kinds of signal changes that come with sample rate conversion, especially when downsampling to a lower samplerate; but definitely not crackling).
A wild guess would be, that there is something wrong with the input buffer. Crackling often occurs if samples are omitted or duplicated, oftentimes the result of the boundaries of subsequent buffers not being correct.
Sending audio data over network in realtime will require compression, no matter what. The required data rate is simply too high. There are codecs which provide lossless audio compression (e.g. FLAC), but their compression ratio is comparatively low compared to e.g. Opus.
The problem was solved by buffering packets at receiving end and writing them to the soundcard buffer as soon as some amount has been reached. The 'crackling' noise was then most likely due to the gaps between subsequent frames that were sent to the soundcard buffer

Pre-loading audio buffers - what is reasonable and reliable?

I am converting an audio signal processing application from Win XP to Win 7 (at least). You can imagine it is a sonar application - a signal is generated and sent out, and a related/modified signal is read back in. The application wants exclusive use of the audio hardware, and cannot afford glitches - we don't want to read headlines like "Windows beep causes missile launch".
Looking at the Windows SDK audio samples, the most relevant one to my case is the RenderExclusiveEventDriven example. Outside the audio engine, it prepares 10 seconds of audio to play, which provides it in 10ms chunks to the rendering engine via an IAudioRenderClient object's GetBuffer() and ReleaseBuffer(). It first uses these functions to pre-load a single 10ms chunk of audio, then relies on regular 10ms events to load subsequent chunks.
Hopefully this means there will always be 10-20ms of audio data buffered. How reliable (i.e. glitch-free) should we expect this to be on reasonably modern hardware (less than 18months old)?
Previously, one readily could pre-load at least half a second worth of audio into via the waveXXX() API, so that if Windows got busy elsewhere, audio continuity was less likely to be affected. 500ms seems like a lot safer margin than 10-20ms... but if you want both event-driven and exclusive-mode, the IAudioRenderClient documentation doesn't exactly make it clear if it is or is not possible to pre-load more than a single IAudioRenderClient buffer worth.
Can anyone confirm if more extensive pre-loading is still possible? Is it recommended, discouraged or neither?
If you are worried about launching missiles, I don't think you should be using Windows or any other non Real-Time operating system.
That said, we are working on another application that consumes a much higher bandwidth of data (400 MB/s continuously for hours or more). We have seen glitches where the operating system becomes unresponsive for up to 5 seconds, so we have large buffers on the data acquisition hardware.
Like with everything else in computing, the wider you go you:
increase throughput
increase latency
I'd say 512 samples buffer is the minimum typically used for non-demanding latency wise applications. I've seen up to 4k buffers. Memory use wise that's still pretty much nothing for contemporary devices - a mere 8 kilobytes of memory per channel for 16 bit audio. You have better playback stability and lower waste of CPU cycles. For audio applications that means you can process more tracks with more DSP before audio begins skipping.
On the other end - I've seen only a few professional audio interfaces, which could handle 32 sample buffers. Most are able to achieve 128 samples, naturally you are still limited to lower channel and effect count, even with professional hardware you increase buffering as your project gets larger, lower it back and disable tracks or effects when you need "real time" to capture a performance. In terms of lowest possible latency actually the same box is capable of achieving lower latency with Linux and a custom real time kernel than on Windows where you are not allowed to do such things. Keep in mind a 64 sample buffer might sound like 8 msec of latency in theory, but in reality it is more like double - because you have both input and output latency plus the processing latency.
For a music player where latency is not an issue you are perfectly fine with a larger buffer, for stuff like games you need to keep it lower for the sake of still having a degree of synchronization between what's going on visually and the sound - you simply cannot have your sound lag half a second behind the action, for music performance capturing together with already recorded material you need to have latency low. You should never go above what your use case requires, because a small buffer will needlessly add to CPU use and the odds of getting audio drop outs. 4k buffering for an audio player is just fine if you can live with half a second of latency between the moment you hit play and the moment you hear the song starting.
I've done a "kind of a hybrid" solution in my DAW project - since I wanted to employ GPGPU for its tremendous performance relative to the CPU I've split the work internally with two processing paths - 64 samples buffer for real time audio which is processed on the CPU, and another considerably wider buffer size for the data which is processed by the GPU. Naturally, they both come out through the "CPU buffer" for the sake of being synchronized perfectly, but the GPU path is "processed in advance" thus allowing higher throughput for already recorded data, and keeping CPU use lower so the real time audio is more reliable. I am honestly surprised professional DAW software hasn't taken this path yet, but not too much, knowing how much money the big fishes of the industry make on hardware that is much less powerful than a modern midrange GPU. They've been claiming that "latency is too much with GPUs" ever since Cuda and OpenCL came out, but with pre-buffering and pre-processing that is really not an issue for data which is already recorded, and increases the size of a project which the DAW can handle tremendously.
The short answer is yes, you can preload a larger amount of data.
This example uses a call to GetDevicePeriod to return the minimum service interval for the device (in nano seconds) and then passes that value along to Initialize. You can pass a larger value if you wish.
The down side to increasing the period is that you're increasing the latency. If you are just playing a waveform back and aren't planning on making changes on the fly then this is not a problem. But if you had a sine generator for example, then the increased latency means that it would take longer for you to hear a change in frequency or amplitude.
Whether or not you have drop outs depends on a number of things. Are you setting thread priorities appropriately? How small is the buffer? How much CPU are you using preparing your samples? In general though, a modern CPU can handle a pretty low-latency. For comparison, ASIO audio devices run perfectly fine at 96kHz with a 2048 sample buffer (20 milliseconds) with multiple channels - no problem. ASIO uses a similar double buffering scheme.
This is too long to be a comment, so it may as well be an answer (with qualifications).
Although it was edited out of final form of the question I submitted, what I had intended by "more extensive pre-loading" was not about the size of buffers used, so much as the number of buffers used. The (somewhat unexpected) answers that resulted all helped widen my understanding.
But I was curious. In the old waveXXX() world, it was possible to "pre-load" multiple buffers via waveOutPrepareHeader() and waveOutWrite() calls, the first waveOutWrite() of which would start playback. My old app "pre-loaded" 60 buffers out of a set of 64 in one burst, each with 512 samples played at 48kHz, creating over 600ms of buffering in a system with a cycle of 10.66ms.
Using multiple IAudioRenderClient::GetBuffer() and IAudioRenderClient::ReleaseBuffer() calls prior to IAudioCient::Start() in the WASAPI world, it appears that the same is still possible... at least on my (very ordinary) hardware, and without extensive testing (yet). This is despite the documentation strongly suggesting that exclusive, event-driven audio is strictly a double-buffering system.
I don't know that anyone should set out to exploit this by design, but I thought I'd point out that it may be supported.

XAudio2 delay with small buffer size

I'm writing a video player. For audio part i'm using XAudio2. For this i have separate thread that is waiting for BufferEnd event and after this fills buffer with new data and call SubmitSourceBuffer.
The problem is that XAudio2(driver or sound card) has huge delays before playing next buffer if buffer size is small (1024 bytes). I made measurements and XAudio takes up to two times long for play such chunk. (1024 bytes chunk of 48khz raw 2-channeled pcm should be played in nearly 5ms, but on my computer it's played up to 10ms). And nearly no delays if i make buffer 4kbytes or more.
I need such small buffer to be able making synchronizations with video clock or external clock (like ffplay does). If i make my buffer too big then end-user will hear lot of noises in output due to synchronization stuff.
Also i have made measurements on all my functions that are decoding and synchronizing audio or anything else that could block or produce delays, they take 0 or 1 ms to execute, so they are not the problem 100%.
Does anybody know what can it be and why it's happenning? Can anyone check if he has same delay problems with small buffer?
I've not experienced any delay or pause using .wav files. If you are using mp3 format, it may add silence at the beginning and end of the sound during the compress operation thus causing a delay in your sound playing. See this post for more information.