XAudio2 delay with small buffer size - c++

I'm writing a video player. For audio part i'm using XAudio2. For this i have separate thread that is waiting for BufferEnd event and after this fills buffer with new data and call SubmitSourceBuffer.
The problem is that XAudio2(driver or sound card) has huge delays before playing next buffer if buffer size is small (1024 bytes). I made measurements and XAudio takes up to two times long for play such chunk. (1024 bytes chunk of 48khz raw 2-channeled pcm should be played in nearly 5ms, but on my computer it's played up to 10ms). And nearly no delays if i make buffer 4kbytes or more.
I need such small buffer to be able making synchronizations with video clock or external clock (like ffplay does). If i make my buffer too big then end-user will hear lot of noises in output due to synchronization stuff.
Also i have made measurements on all my functions that are decoding and synchronizing audio or anything else that could block or produce delays, they take 0 or 1 ms to execute, so they are not the problem 100%.
Does anybody know what can it be and why it's happenning? Can anyone check if he has same delay problems with small buffer?

I've not experienced any delay or pause using .wav files. If you are using mp3 format, it may add silence at the beginning and end of the sound during the compress operation thus causing a delay in your sound playing. See this post for more information.

Related

How to play audio stream over UDP?

I writing a Windows application, It receives audio data from an Android app, I use UDP to transfer data over LAN, and use RtAudio to play audio-stream.
Every UDP package payload is a audio sample array, in 32k/16bit/pcm format.
When data size is 576 bytes, 288 samples in other words, every thing is OK, we can hear a clear voice.
But when data size in 192 bytes, 96 samples in other words, the sound is not clear.
Does anyone have the problem?
It is a balancing act to determine optimum size of each buffer packet ... too large and you progressively move away from real time response yet too small and the code spends proportionately too much time negotiating the boilerplate plumbing of simply transferring the data. Looks like you have hit this lower boundary when as you say 192 bytes starts acting up.
This is true independent of transport mechanism. Also keep in mind the wall clock duration consumed when listening to a few hundred bytes is tiny (typically 44,100 samples per second for CD quality mono audio) so you will not loose much in the real time aspect to give yourself more than that lower bound you have hit.

Playback pure tone, variable phase stream with pyaudio

I'm building an acoustic cancelling device based on Pyaudio, fourier transforms and. c-Media usb audio card. The software is threaded, using the producer/consumer model.
The device detects pure tones in the environment (reads chunks of microphone audio), uses fourier to detect the pure tone, and so far so good it works like a charm.
The final step however is getting tricky. I'm aiming to generate a 100ms wave (sine wave), which holds a certain amounts of periods of the frequency to be cancelled.
This wave buffer has to be played with Pyaudio on a separate thread continuously, which also must increase the phase little by little till detecting the amplitude of the tone in the environment drops. This is basically destructive interference.
My problem is when using Pyaudio.stream.write(), the buffer keeps overruning, since i have NO IDEA, what the function is doing internally. I have tried with many combinations of "frame_buffer_size" and audio lenght and no matter what i do, the buffer is overrun.
Ideally, the buffer does not have to be recalculated with a different phase on each run... instead, i'm trying pyaudio to read a different part of the buffer (window) to start writing the sine wave on a different origin each time.
I have no idea how to do that.
Long story short, how would you:
1) create a thread to fill a circular buffer continuously with audio data.
2) create a pyaudio consumer thread that continuously reads the buffer without overruning.
3) manipulate the volume on realtime
My output data must be 44100 hz, little endian, 16bit signed int. Any hints, advise, references or suggestions will be greatly appreciated.

Pre-loading audio buffers - what is reasonable and reliable?

I am converting an audio signal processing application from Win XP to Win 7 (at least). You can imagine it is a sonar application - a signal is generated and sent out, and a related/modified signal is read back in. The application wants exclusive use of the audio hardware, and cannot afford glitches - we don't want to read headlines like "Windows beep causes missile launch".
Looking at the Windows SDK audio samples, the most relevant one to my case is the RenderExclusiveEventDriven example. Outside the audio engine, it prepares 10 seconds of audio to play, which provides it in 10ms chunks to the rendering engine via an IAudioRenderClient object's GetBuffer() and ReleaseBuffer(). It first uses these functions to pre-load a single 10ms chunk of audio, then relies on regular 10ms events to load subsequent chunks.
Hopefully this means there will always be 10-20ms of audio data buffered. How reliable (i.e. glitch-free) should we expect this to be on reasonably modern hardware (less than 18months old)?
Previously, one readily could pre-load at least half a second worth of audio into via the waveXXX() API, so that if Windows got busy elsewhere, audio continuity was less likely to be affected. 500ms seems like a lot safer margin than 10-20ms... but if you want both event-driven and exclusive-mode, the IAudioRenderClient documentation doesn't exactly make it clear if it is or is not possible to pre-load more than a single IAudioRenderClient buffer worth.
Can anyone confirm if more extensive pre-loading is still possible? Is it recommended, discouraged or neither?
If you are worried about launching missiles, I don't think you should be using Windows or any other non Real-Time operating system.
That said, we are working on another application that consumes a much higher bandwidth of data (400 MB/s continuously for hours or more). We have seen glitches where the operating system becomes unresponsive for up to 5 seconds, so we have large buffers on the data acquisition hardware.
Like with everything else in computing, the wider you go you:
increase throughput
increase latency
I'd say 512 samples buffer is the minimum typically used for non-demanding latency wise applications. I've seen up to 4k buffers. Memory use wise that's still pretty much nothing for contemporary devices - a mere 8 kilobytes of memory per channel for 16 bit audio. You have better playback stability and lower waste of CPU cycles. For audio applications that means you can process more tracks with more DSP before audio begins skipping.
On the other end - I've seen only a few professional audio interfaces, which could handle 32 sample buffers. Most are able to achieve 128 samples, naturally you are still limited to lower channel and effect count, even with professional hardware you increase buffering as your project gets larger, lower it back and disable tracks or effects when you need "real time" to capture a performance. In terms of lowest possible latency actually the same box is capable of achieving lower latency with Linux and a custom real time kernel than on Windows where you are not allowed to do such things. Keep in mind a 64 sample buffer might sound like 8 msec of latency in theory, but in reality it is more like double - because you have both input and output latency plus the processing latency.
For a music player where latency is not an issue you are perfectly fine with a larger buffer, for stuff like games you need to keep it lower for the sake of still having a degree of synchronization between what's going on visually and the sound - you simply cannot have your sound lag half a second behind the action, for music performance capturing together with already recorded material you need to have latency low. You should never go above what your use case requires, because a small buffer will needlessly add to CPU use and the odds of getting audio drop outs. 4k buffering for an audio player is just fine if you can live with half a second of latency between the moment you hit play and the moment you hear the song starting.
I've done a "kind of a hybrid" solution in my DAW project - since I wanted to employ GPGPU for its tremendous performance relative to the CPU I've split the work internally with two processing paths - 64 samples buffer for real time audio which is processed on the CPU, and another considerably wider buffer size for the data which is processed by the GPU. Naturally, they both come out through the "CPU buffer" for the sake of being synchronized perfectly, but the GPU path is "processed in advance" thus allowing higher throughput for already recorded data, and keeping CPU use lower so the real time audio is more reliable. I am honestly surprised professional DAW software hasn't taken this path yet, but not too much, knowing how much money the big fishes of the industry make on hardware that is much less powerful than a modern midrange GPU. They've been claiming that "latency is too much with GPUs" ever since Cuda and OpenCL came out, but with pre-buffering and pre-processing that is really not an issue for data which is already recorded, and increases the size of a project which the DAW can handle tremendously.
The short answer is yes, you can preload a larger amount of data.
This example uses a call to GetDevicePeriod to return the minimum service interval for the device (in nano seconds) and then passes that value along to Initialize. You can pass a larger value if you wish.
The down side to increasing the period is that you're increasing the latency. If you are just playing a waveform back and aren't planning on making changes on the fly then this is not a problem. But if you had a sine generator for example, then the increased latency means that it would take longer for you to hear a change in frequency or amplitude.
Whether or not you have drop outs depends on a number of things. Are you setting thread priorities appropriately? How small is the buffer? How much CPU are you using preparing your samples? In general though, a modern CPU can handle a pretty low-latency. For comparison, ASIO audio devices run perfectly fine at 96kHz with a 2048 sample buffer (20 milliseconds) with multiple channels - no problem. ASIO uses a similar double buffering scheme.
This is too long to be a comment, so it may as well be an answer (with qualifications).
Although it was edited out of final form of the question I submitted, what I had intended by "more extensive pre-loading" was not about the size of buffers used, so much as the number of buffers used. The (somewhat unexpected) answers that resulted all helped widen my understanding.
But I was curious. In the old waveXXX() world, it was possible to "pre-load" multiple buffers via waveOutPrepareHeader() and waveOutWrite() calls, the first waveOutWrite() of which would start playback. My old app "pre-loaded" 60 buffers out of a set of 64 in one burst, each with 512 samples played at 48kHz, creating over 600ms of buffering in a system with a cycle of 10.66ms.
Using multiple IAudioRenderClient::GetBuffer() and IAudioRenderClient::ReleaseBuffer() calls prior to IAudioCient::Start() in the WASAPI world, it appears that the same is still possible... at least on my (very ordinary) hardware, and without extensive testing (yet). This is despite the documentation strongly suggesting that exclusive, event-driven audio is strictly a double-buffering system.
I don't know that anyone should set out to exploit this by design, but I thought I'd point out that it may be supported.

WASAPI lagging playback

I'm writing a program to windows store in c++ which plays back the microphone. I have to modify the bits before sending that to the speakers. Firstly I wanted to play back the microphone without any effect bit it is lagging. The frequency and the bit rate is the same (24 bit, 192000Hz) but I also tried with (24 bit, 96000Hz). I debugged it and it seems that the speaker is faster therefore it has to wait for the data from the microphone like the squeakers would work in a higher frequency but according to the settings it doesn't. Dose anyone have a sightliest idea what is the problem here?
When you say that there are some 'lag', do you mean that there are some delay between when you feed the audio capture device with data and when the playback device renders the data or do you mean that the audio stream is 'chopped' with small pauses in between each sample being rendered?
If there's delay in playback I would take a look at with what latency value you've initialized the audio capture client.
If there are small pauses then I would recommend you using double buffering of sample data so that one buffer is being rendered while the other is being re-fetched from the audio capture device.

Running music as SDL_Mixer chunks

Currently, SDL_Mixer has two types of sound resources: chunks and music.
Apart from the API and supported formats limitations, are there any reasons not to load and play music as a SDL_Chunk and channel? (memory, speed, etc.)
The API is the real issue. The "music" APIs are designed to deal with streaming compressed music, while the "sound" APIs aren't. Then again, if you manage to make it work in your app, then it works.
I haven't looked at the SDL code, but my guess would be the "chunks" are intended for smaller sound samples, and are cached in memory, decoded, in their entirety while the "music" is streamed (not cached in memory in its entirety, but decoded and buffered as needed, with the assumption that it would, for the most part, be played from the beginning, and continuously from that point, with maybe some reset back to the beginning occasionally)
So the reason is memory. You don't want to decode say, 4 minutes of a 16-bit stereo song into memory, as it will eat 44100Hz * 2bytes * 2channels *4minutes *60sec/min == 42336000 bytes if you attempt it, when you can decode and buffer smaller pieces of it.
OTOH, if you have the ~10Mb of RAM per minute of music to waste and you need the CPU that would be consumed by the on-the-fly decoding... you could probably use chunks.