I have a C++ application that reads and processes a video stream. I have two threads: one thread to read the stream and a second thread for processing. I access the stream with OpenCV VideoCapture and put frames (cv::Mat) in the readerwriterqueue buffer. From another thread, I read the frames from the buffer and process them.
Sometimes processing may take a lot of time and processing thread start to fall behind (while frames are put into the queue at the same speed). This increases the buffer size and may eventually take all available memory and hang the whole system. I know that Windows uses pagefile if there is not enough RAM, but the system still becomes pretty laggy. I need to make sure this won't happen.
I thought about setting a limit on the buffer size and offload frames to disk when the buffer is full. Then read them back in the queue when there is space. Would that work? Are there any good alternatives? How would one handle such a problem? Is my current approach (image queue) valid? Please advice.
Related
I recently picked up a project where I need to perform a real-time sliding FFT analysis on incoming microphone data. The environment I picked to do this in, is OpenGL and Cinder and using C++.
This is my first experience in audio programming and I am a little bit confused.
This is what I am trying to achieve in my OpenGL application:
So in every frame, there's a part of the incoming data. In a for-loop (therefore multiple passes) a window of the present data will be consumed and FFT analysis will be performed on it. For next iteration of the for-loop, window will advance "hop-size" through the data and etc. until the end of the data is reached.
Now this process must be contiguous. But as you can see in the figure above, as soon as my current app frame ends and when next frame's data comes in, I can't pick up where I left the previous frame (because data is already gone). You can see it in figure where the blue area is in-between two frames.
Now you may say, pick the window-size / hop-size in a way that this never happens but that is impossible since these parameters should left user-configurable in my project.
Suggestions for this kind of processing, oriented towards C++11 is also very welcomed!
Thanks!
Not sure I understand your scenario 100%, but sounds like you may want to use a circular buffer. There is no "standard" circular buffer, but there's one in Boost.
However, you'd need a lock if you plan to do the processing with 2 threads. One thread, for example, would wait on the audio input, then take the buffer lock, and copy from the audio buffer to the circular buffer. The second thread would periodically take the buffer lock and read the next k elements, if there are at least k available in the buffer...
You'd need to adjust the size of the buffer appropriately and make sure you always handle the data faster than the incoming rate to avoid losses in the circular buffer...
Not sure why you mention that the buffer is lock-free and whether that is a requirement, I'd try the circular buffer with locks first as it seems simpler conceptually, and only go lock-free if you have to, because the data structure could be more complicated in this case (but maybe a "producer-consumer" lock-free queue would work)...
HTH.
Thanks for posting a graphic--that illustrates the problem nicely.
All you really need here is a buffer of size (window - 1) where you can store zero or more samples from the "previous" frame for processing in the "next" one. In C++ this would be:
std::vector<Sample> interframeBuffer;
interframeBuffer.reserve(windowSize - 1);
Then when you are within windowSize samples from the end of the current frame, rather than process the samples you store them with interframeBuffer.push_back(sample). When you start processing the next frame, you first do:
for (const Sample& sample : interframeBuffer) {
process(sample);
}
interframeBuffer.clear();
You should use a single vector the whole time, clearing it and repopulating it as needed, to avoid memory allocation. That's why we call reserve() at the top--to avoid latency later on. Calling clear() doesn't release the memory, it just resets the size() to zero.
Using glReadPixels on 1 single pixel stalls the pipeline even if I have swapped the buffers just before.
I don't need synchronization, I can do something like this:
pixel=DEFAULT_VALUE;
while (1){
draw(pixel);
swapBuffers();
pixel=glRead???;
}
How can I do this in an optimized(not stalling) way?
You can asynchronous pixel transfers via Pixel Buffer Objects (PBOs). When you issue a read call without PBOs, the pipeline is flushed and the CPU has to wait for the GPU to finish rendering and transfering the data. With PBOs, you provide a buffer in advance, and the data will be copied into that buffer when the GPU is ready, so it will not stall. It of course will stall when you try to access that buffer before it is ready (e.g. by glGetBufferSubData() or mapping that buffer for reading etc). So ideally, before reading back the data, you can queue up some other render commands, and also do some other CPU work, before accessing the buffer. The extension spec I linked has an example section, which is quite interesting.
This stuff can also be combined with sync objects. In that case, you can add a fence sync after the read call which will copy the data into the PBO. Then, on the CPU you can actually check if the operation is already completed. If not, you can do some other work and check back.
The main problem with all this asynchronous transfers is that you trade throughput for latency. So if you need that pixel value immediately, and do';t have any other work for the GPU and CPU which can be done inbetween, there is not much to gain. You then cannot really avoid the stalling, then.
I have a Firewire camera whose driver software deposits incoming images into a circular buffer of 16 images. I would like to avoid re-buffering these images, and just write them as fast as possible to disk. So I would prefer to just enqueue a pointer to each buffer as it is filled, and have a separate disk write thread which kept far enough ahead of the incoming images to be confident that it would write them out to disk before the incoming images overwrote them.
Clearly this would depend on the image size and frame rate... but in principle, for VGA images at 30 frames per second, we're talking about needing to write 27.6 MB/sec. This seems quite achievable, particularly if the writing thread can decide to drop occasional frames to keep far enough ahead of the incoming images, and if this strategy fails, to at least detect that an overwrite has invalidated the image, and signal that appropriately (e.g. delete the file after completion).
Comments on the validity of this strategy are welcome... but what I really want to know is what disk writing functions should be used for maximum efficency to get the disk writing rate up as high as possible. E.g. CreateFile() using FILE_FLAG_NO_BUFFERING + WriteFile()?
I'm writing a video player. For audio part i'm using XAudio2. For this i have separate thread that is waiting for BufferEnd event and after this fills buffer with new data and call SubmitSourceBuffer.
The problem is that XAudio2(driver or sound card) has huge delays before playing next buffer if buffer size is small (1024 bytes). I made measurements and XAudio takes up to two times long for play such chunk. (1024 bytes chunk of 48khz raw 2-channeled pcm should be played in nearly 5ms, but on my computer it's played up to 10ms). And nearly no delays if i make buffer 4kbytes or more.
I need such small buffer to be able making synchronizations with video clock or external clock (like ffplay does). If i make my buffer too big then end-user will hear lot of noises in output due to synchronization stuff.
Also i have made measurements on all my functions that are decoding and synchronizing audio or anything else that could block or produce delays, they take 0 or 1 ms to execute, so they are not the problem 100%.
Does anybody know what can it be and why it's happenning? Can anyone check if he has same delay problems with small buffer?
I've not experienced any delay or pause using .wav files. If you are using mp3 format, it may add silence at the beginning and end of the sound during the compress operation thus causing a delay in your sound playing. See this post for more information.
I am developing an application that needs to read back the whole frame from the front buffer of an openGL application. I can hijack the application's opengl library and insert my code on swapbuffers. At the moment I am successfully using a simple but excruciating slow glReadPixels command without PBO's.
Now I read about using multiple PBO's to speed things up. While I think I've found enough resources to actually program that (isn't that hard), I have some operational questions left. I would do something like this:
create a series (e.g. 3) of PBO's
use glReadPixels in my swapBuffers override to read data from front buffer to a PBO (should be fast and non-blocking, right?)
Create a seperate thread to call glMapBufferARB, once per PBO after a glReadPixels, because this will block until the pixels are in client memory.
Process the data from step 3.
Now my main concern is of course in steps 2 and 3. I read about glReadPixels used on PBO's being non-blocking, will this be an issue if I issue new opengl commands after that very fast? Will those opengl commands block? Or will they continue (my guess), and if so, I guess only swapbuffers can be a problem, will this one stall or will glReadPixels from front buffer be many times faster than swapping (about each 15->30ms) or, worst case scenario, will swapbuffers be executed while glReadPixels is still reading data to the PBO? My current guess is this logic will do something like this: copy FRONT_BUFFER -> generic place in VRAM, copy VRAM->RAM. But I have no idea which of those 2 is the real bottleneck and more, what the influence on the normal opengl command stream is.
Then in step 3. Is it wise to do this asynchronously in a thread separated from normal opengl logic? At the moment I think not, It seems you have to restore buffer operations to normal after doing this and I can't install synchronization objects in the original code to temporarily block those. So I think my best option is to define a certain swapbuffer delay before reading them out, so e.g. calling glReadPixels on PBO i%3 and glMapBufferARB on PBO (i+2)%3 in the same thread, resulting in a delay of 2 frames. Also, when I call glMapBufferARB to use data in client memory, will this be the bottleneck or will glReadPixels (asynchronously) be the bottleneck?
And finally, if you have some better ideas to speed up frame readback from GPU in opengl, please tell me, because this is a painful bottleneck in my current system.
I hope my question is clear enough, I know the answer will probably also be somewhere on the internet but I mostly came up with results that used PBO's to keep buffers in video memory and do processing there. I really need to read back the front buffer to RAM and I do not find any clear explanations about performance in that case (which I need, I cannot rely on "it's faster", I need to explain why it's faster).
Thank you
Are you sure you want to read from the front buffer? You do not own this buffer, and depending on your OS it might be destroyed, e.g., by another window on top of it.
For your use case, people typically do
draw N
start PBO read N from back buffer
draw N+1
start PBO read N+1
sync PBO read N
process N
...
from a single thread.