How to play audio stream over UDP? - c++

I writing a Windows application, It receives audio data from an Android app, I use UDP to transfer data over LAN, and use RtAudio to play audio-stream.
Every UDP package payload is a audio sample array, in 32k/16bit/pcm format.
When data size is 576 bytes, 288 samples in other words, every thing is OK, we can hear a clear voice.
But when data size in 192 bytes, 96 samples in other words, the sound is not clear.
Does anyone have the problem?

It is a balancing act to determine optimum size of each buffer packet ... too large and you progressively move away from real time response yet too small and the code spends proportionately too much time negotiating the boilerplate plumbing of simply transferring the data. Looks like you have hit this lower boundary when as you say 192 bytes starts acting up.
This is true independent of transport mechanism. Also keep in mind the wall clock duration consumed when listening to a few hundred bytes is tiny (typically 44,100 samples per second for CD quality mono audio) so you will not loose much in the real time aspect to give yourself more than that lower bound you have hit.

Related

Is there a standardized method to send JPG images over network with TCP/IP?

I am looking for a standardized approach to stream JPG images over the network. Also desirable would be a C++ programming interface, which can be easily integrated into existing software.
I am working on a GPGPU program which processes digitized signals and compresses them to JPG images. The size of the images can be defined by a user, typically the images are 1024 x 2048 or 2048 x 4096 pixel. I have written my "own" protocol, which first sends a header (image size in bytes, width, height and corresponding channel) and then the JPG data itself via TCP. After that the receiver sends a confirmation that all data are received and displayed correctly, so that the next image can be sent. So far so good, unfortunately my approach reaches just 12 fps, which does not satisfy the project requirements.
I am sure that there are better approaches that have higher frame rates. Which approach do streaming services like Netflix and Amzon take for UHD videos? Of course I googled a lot, but I couldn't find any satisfactory results.
Is there a standardized method to send JPG images over network with TCP/IP?
There are several internet protocols that are commonly used to transfer files over TCP. Perhaps the most commonly used protocol is HTTP. Another, older one is FTP.
Which approach do streaming services like Netflix and Amzon take for UHD videos?
Firstly, they don't use JPEG at all. They use some video compression codec (such as MPEG), that does not only compress the data spatially, but also temporally (successive frames tend to hold similar data). An example of the protocol that they might use to stream the data is DASH, which is operates over HTTP.
I don't have a specific library in mind that already does these things well, but some items to keep in mind:
Most image / screenshare/ video streaming applications use exclusively UDP, RTP,RTSP for the video stream data, in a lossy fashion. They use TCP for control flow data, like sending key commands, or communication between client / server on what to present, but the streamed data is not TCP.
If you are streaming video, see this.
Sending individual images you just need efficient methods to compress, serialize, and de-serialize, and you probably want to do so in a batch fashion instead of just one at a time.Batch 10 jpegs together, compress them, serialize them, send.
You mentioned fps so it sounds like you are trying to stream video and not just copy over images in fast way. I'm not entirely sure what you are trying to do. Can you elaborate on the digitized signals and why they have to be in jpeg? Can they not be in some other format, later converted to jpeg at the receiving end?
This is not a direct answer to your question, but a suggestion that you will probably need to change how you are sending your movie.
Here's a calculation: Suppose you can get 1Gb/s throughput out of your network. If each 2048x4096 file compresses to about 10MB (80Mb), then:
1000000000 ÷ (80 × 1000000) = 12.5
So, you can send about 12 frames a second. This means if you have a continuous stream of JPGs you want to display, if you want faster frame rates, you need a faster network.
If your stream is a fixed length movie, then you could buffer the data and start the movie after enough data is buffered to allow playback at desired frame rate sooner than waiting for the entire movie to download. If you want playback at 24 frames a second, then you will need to buffer at least 2/3rds of the movie before you being playback, because the the playback is twice is fast as your download speed.
As stated in another answer, you should use a streaming codec so that you can also take advantage of compression between successive frames, rather than just compressing the current frame alone.
To sum up, your playback rate will be limited by the number of frames you can transfer per second if the stream never ends (e.g., a live stream).
If you have a fixed length movie, buffering can be used to hide the throughput and latency bottlenecks.

GOP size for realtime video stream

I'm working on a kind of rich remote desktop system, with a video stream of the desktop encoded using avcodec/x264. I have to set manually the GOP size for the stream, and so far I was using a size of fps/2.
But I've just read the following on Wikipedia:
This structure [Group Of Picture# suggests a problem because the fourth frame (a P-frame) is needed in order to predict the second and the third (B-frames). So we need to transmit the P-frame before the B-frames and it will delay the transmission (it will be necessary to keep the P-frame).
It means I'm creating a lot of latency since the client needs to receive at least half of the GOP to output the first frame following the I frame. What is the best strategy for the GOP size if I want the smallest latency possible ? A gop of 1 picture ?
If you want to minimize latency with h264, you should generally avoid b-frames. This way the decoder has at least a chance to emit decoded frames early. This prevents decoder-induced latency.
You may also want to tune the encoder for latency, by reducing/disabling look-ahead. x264 has a "zero-latency" setting which should be a good starting point for finding you optimal settings.
The "GOP" size (which afaik is not really defined for h264; I'll just assume you mean the I(DR)-frame interval) does not necessarily affect the latency. This parameter only affects how long a client will have to wait until it can "sync" on the stream (time-to-first-picture).

Pre-loading audio buffers - what is reasonable and reliable?

I am converting an audio signal processing application from Win XP to Win 7 (at least). You can imagine it is a sonar application - a signal is generated and sent out, and a related/modified signal is read back in. The application wants exclusive use of the audio hardware, and cannot afford glitches - we don't want to read headlines like "Windows beep causes missile launch".
Looking at the Windows SDK audio samples, the most relevant one to my case is the RenderExclusiveEventDriven example. Outside the audio engine, it prepares 10 seconds of audio to play, which provides it in 10ms chunks to the rendering engine via an IAudioRenderClient object's GetBuffer() and ReleaseBuffer(). It first uses these functions to pre-load a single 10ms chunk of audio, then relies on regular 10ms events to load subsequent chunks.
Hopefully this means there will always be 10-20ms of audio data buffered. How reliable (i.e. glitch-free) should we expect this to be on reasonably modern hardware (less than 18months old)?
Previously, one readily could pre-load at least half a second worth of audio into via the waveXXX() API, so that if Windows got busy elsewhere, audio continuity was less likely to be affected. 500ms seems like a lot safer margin than 10-20ms... but if you want both event-driven and exclusive-mode, the IAudioRenderClient documentation doesn't exactly make it clear if it is or is not possible to pre-load more than a single IAudioRenderClient buffer worth.
Can anyone confirm if more extensive pre-loading is still possible? Is it recommended, discouraged or neither?
If you are worried about launching missiles, I don't think you should be using Windows or any other non Real-Time operating system.
That said, we are working on another application that consumes a much higher bandwidth of data (400 MB/s continuously for hours or more). We have seen glitches where the operating system becomes unresponsive for up to 5 seconds, so we have large buffers on the data acquisition hardware.
Like with everything else in computing, the wider you go you:
increase throughput
increase latency
I'd say 512 samples buffer is the minimum typically used for non-demanding latency wise applications. I've seen up to 4k buffers. Memory use wise that's still pretty much nothing for contemporary devices - a mere 8 kilobytes of memory per channel for 16 bit audio. You have better playback stability and lower waste of CPU cycles. For audio applications that means you can process more tracks with more DSP before audio begins skipping.
On the other end - I've seen only a few professional audio interfaces, which could handle 32 sample buffers. Most are able to achieve 128 samples, naturally you are still limited to lower channel and effect count, even with professional hardware you increase buffering as your project gets larger, lower it back and disable tracks or effects when you need "real time" to capture a performance. In terms of lowest possible latency actually the same box is capable of achieving lower latency with Linux and a custom real time kernel than on Windows where you are not allowed to do such things. Keep in mind a 64 sample buffer might sound like 8 msec of latency in theory, but in reality it is more like double - because you have both input and output latency plus the processing latency.
For a music player where latency is not an issue you are perfectly fine with a larger buffer, for stuff like games you need to keep it lower for the sake of still having a degree of synchronization between what's going on visually and the sound - you simply cannot have your sound lag half a second behind the action, for music performance capturing together with already recorded material you need to have latency low. You should never go above what your use case requires, because a small buffer will needlessly add to CPU use and the odds of getting audio drop outs. 4k buffering for an audio player is just fine if you can live with half a second of latency between the moment you hit play and the moment you hear the song starting.
I've done a "kind of a hybrid" solution in my DAW project - since I wanted to employ GPGPU for its tremendous performance relative to the CPU I've split the work internally with two processing paths - 64 samples buffer for real time audio which is processed on the CPU, and another considerably wider buffer size for the data which is processed by the GPU. Naturally, they both come out through the "CPU buffer" for the sake of being synchronized perfectly, but the GPU path is "processed in advance" thus allowing higher throughput for already recorded data, and keeping CPU use lower so the real time audio is more reliable. I am honestly surprised professional DAW software hasn't taken this path yet, but not too much, knowing how much money the big fishes of the industry make on hardware that is much less powerful than a modern midrange GPU. They've been claiming that "latency is too much with GPUs" ever since Cuda and OpenCL came out, but with pre-buffering and pre-processing that is really not an issue for data which is already recorded, and increases the size of a project which the DAW can handle tremendously.
The short answer is yes, you can preload a larger amount of data.
This example uses a call to GetDevicePeriod to return the minimum service interval for the device (in nano seconds) and then passes that value along to Initialize. You can pass a larger value if you wish.
The down side to increasing the period is that you're increasing the latency. If you are just playing a waveform back and aren't planning on making changes on the fly then this is not a problem. But if you had a sine generator for example, then the increased latency means that it would take longer for you to hear a change in frequency or amplitude.
Whether or not you have drop outs depends on a number of things. Are you setting thread priorities appropriately? How small is the buffer? How much CPU are you using preparing your samples? In general though, a modern CPU can handle a pretty low-latency. For comparison, ASIO audio devices run perfectly fine at 96kHz with a 2048 sample buffer (20 milliseconds) with multiple channels - no problem. ASIO uses a similar double buffering scheme.
This is too long to be a comment, so it may as well be an answer (with qualifications).
Although it was edited out of final form of the question I submitted, what I had intended by "more extensive pre-loading" was not about the size of buffers used, so much as the number of buffers used. The (somewhat unexpected) answers that resulted all helped widen my understanding.
But I was curious. In the old waveXXX() world, it was possible to "pre-load" multiple buffers via waveOutPrepareHeader() and waveOutWrite() calls, the first waveOutWrite() of which would start playback. My old app "pre-loaded" 60 buffers out of a set of 64 in one burst, each with 512 samples played at 48kHz, creating over 600ms of buffering in a system with a cycle of 10.66ms.
Using multiple IAudioRenderClient::GetBuffer() and IAudioRenderClient::ReleaseBuffer() calls prior to IAudioCient::Start() in the WASAPI world, it appears that the same is still possible... at least on my (very ordinary) hardware, and without extensive testing (yet). This is despite the documentation strongly suggesting that exclusive, event-driven audio is strictly a double-buffering system.
I don't know that anyone should set out to exploit this by design, but I thought I'd point out that it may be supported.

XAudio2 delay with small buffer size

I'm writing a video player. For audio part i'm using XAudio2. For this i have separate thread that is waiting for BufferEnd event and after this fills buffer with new data and call SubmitSourceBuffer.
The problem is that XAudio2(driver or sound card) has huge delays before playing next buffer if buffer size is small (1024 bytes). I made measurements and XAudio takes up to two times long for play such chunk. (1024 bytes chunk of 48khz raw 2-channeled pcm should be played in nearly 5ms, but on my computer it's played up to 10ms). And nearly no delays if i make buffer 4kbytes or more.
I need such small buffer to be able making synchronizations with video clock or external clock (like ffplay does). If i make my buffer too big then end-user will hear lot of noises in output due to synchronization stuff.
Also i have made measurements on all my functions that are decoding and synchronizing audio or anything else that could block or produce delays, they take 0 or 1 ms to execute, so they are not the problem 100%.
Does anybody know what can it be and why it's happenning? Can anyone check if he has same delay problems with small buffer?
I've not experienced any delay or pause using .wav files. If you are using mp3 format, it may add silence at the beginning and end of the sound during the compress operation thus causing a delay in your sound playing. See this post for more information.

openAL - choppy sound when playing the buffers

I have a voice chat which receives rtp packets (each packet contains 20ms of voice afaik), adds them to a buffer and plays it out.
If I call alSourcePlay() directly after buffering a packet(I have 5 buffers and each buffer gets one packet, which are then re-used once the packets are played), the sound will be 'choppy' since it will play out the buffer before another packet arrives.
My question is how do you deal with this so that audio isn't played as choppy?
If you are, on average, getting less than 50 20ms packets per minute then there has to be pauses somewhere. If you store the packets for a while before playing them, then you can look for natural pauses ( silence ) and combine the gaps with the natural pauses so things sound more natural. The more you store the better playback will sound, but do it too much and the delay will become unpleasant.
The amount of buffering you need is a matter of taste. Which is uglier, a choppy sound or a delayed response. I guess you will have to design it so it is a variable and and then experiment to find the 'happy medium'
If you are short, at a maximum, of 10 packets per second, then a simpler scheme suggests itself: Place a delay of 4ms between each packet, which should be undetectable. Run for 1 second. See how many packets have accumulated ( if you only go 40 packets, this would be zero ) Adjust the inter-packet delay to compensate. Continue.