STFT / sliding FFT on real-time data - c++

I recently picked up a project where I need to perform a real-time sliding FFT analysis on incoming microphone data. The environment I picked to do this in, is OpenGL and Cinder and using C++.
This is my first experience in audio programming and I am a little bit confused.
This is what I am trying to achieve in my OpenGL application:
So in every frame, there's a part of the incoming data. In a for-loop (therefore multiple passes) a window of the present data will be consumed and FFT analysis will be performed on it. For next iteration of the for-loop, window will advance "hop-size" through the data and etc. until the end of the data is reached.
Now this process must be contiguous. But as you can see in the figure above, as soon as my current app frame ends and when next frame's data comes in, I can't pick up where I left the previous frame (because data is already gone). You can see it in figure where the blue area is in-between two frames.
Now you may say, pick the window-size / hop-size in a way that this never happens but that is impossible since these parameters should left user-configurable in my project.
Suggestions for this kind of processing, oriented towards C++11 is also very welcomed!
Thanks!

Not sure I understand your scenario 100%, but sounds like you may want to use a circular buffer. There is no "standard" circular buffer, but there's one in Boost.
However, you'd need a lock if you plan to do the processing with 2 threads. One thread, for example, would wait on the audio input, then take the buffer lock, and copy from the audio buffer to the circular buffer. The second thread would periodically take the buffer lock and read the next k elements, if there are at least k available in the buffer...
You'd need to adjust the size of the buffer appropriately and make sure you always handle the data faster than the incoming rate to avoid losses in the circular buffer...
Not sure why you mention that the buffer is lock-free and whether that is a requirement, I'd try the circular buffer with locks first as it seems simpler conceptually, and only go lock-free if you have to, because the data structure could be more complicated in this case (but maybe a "producer-consumer" lock-free queue would work)...
HTH.

Thanks for posting a graphic--that illustrates the problem nicely.
All you really need here is a buffer of size (window - 1) where you can store zero or more samples from the "previous" frame for processing in the "next" one. In C++ this would be:
std::vector<Sample> interframeBuffer;
interframeBuffer.reserve(windowSize - 1);
Then when you are within windowSize samples from the end of the current frame, rather than process the samples you store them with interframeBuffer.push_back(sample). When you start processing the next frame, you first do:
for (const Sample& sample : interframeBuffer) {
process(sample);
}
interframeBuffer.clear();
You should use a single vector the whole time, clearing it and repopulating it as needed, to avoid memory allocation. That's why we call reserve() at the top--to avoid latency later on. Calling clear() doesn't release the memory, it just resets the size() to zero.

Related

Seeking within MP3 file

I am working on the development of driving software for the hardware implementation by these people. The decoder works properly in overall, but I am struggling making it starting playing the sound at the middle. I suspect that it is common feature of the MP3 decoders as they must have some history of data in order to properly construct current sound (I am not that skilled in MPEG, however have an idea of some basics).
The problem is that this decoder is a black box, and any deepening in its code is an enormous time and effort.
I empirically found out that the sound garbage, when starting somewhere in the middle, happens in no more that 1 (one) seconds after start with file # 320 kbps and 44100 sampling rate. I am actually ok to mute decoder for a second (while it gathers/decodes proper required data for further playback), and then unmute it to continue playback.
I did search on the internet for the matter, did not find anything useful. Tried to invalidate first frames by corrupting frame headers (the easiest that could be done without going into the MP3 headers/data), made things even worse.
Questions:
is there any body of knowledge of how players perform seek in MP3 files and keep non-corrupt sound?
Is my action plan seem valid - mute for 1 second while decoder plays garbage? Is there any way to (easily) calculate the time I must mute output for?
Update: just tried on another file # 128 kbps/48k and maximal garbage time to be about 2 seconds... I can not believe that decoder with so limited resources - input buffer used is 2 kB with some intermediate working buffers, in total must be not more than 36 kB - can keep the history for 2 seconds, or decoder is having problems finding the sync word in the stream... and thus my driver needs to figure out the frame start (by finding out sync word, reading frame header, calculating frame size, and looking after the frame to contain another sync word).
I've found workarounds. The difficulty was that there are actually two problems overlaying each other, but was easy to cope with having structured approach.
The decoder is having issues getting the first sync word of the stream, and works very well when the first bytes supplied to it are FF FB or FF FA. All other bytes - in the middle of the frame - with very high probability, cause major sound corruption, until decoder catches correct sync. Thus I designed the code seeking to the next frame start after the seek point, checking that this is actual start of the frame by calculating frame size and looking at the next frame to contain FFFB/FA.
Having fixed the problem 1 I have had minor corruption left from the decoder starting decoding the frame without historical data. I have solved it by muting the decoder for the first 4 buffering transactions.
Major corruption still happens, but is rare, and it seems that nature of corruption depends on what was in the decoder buffers (not only Huffman input buffer, but other intermediate buffers) before the decoder is instructed to start. My hardware performs clear of the input buffers to 0 when decoder is in reset state, but it seems to be not enough (or just incorrect)...
The decoder itself is a kind of PoC (proof of concept) work, a student term with the aim to prove that they were able to make it; the package is having test bench code, but lacks low level documentation/comments in the code, and is not ready for field implementation and production. In general the fact that it worked for me at all (almost) out of the box makes the honor to the developers and is a mark of high quality of their work. I have reviewed and tried several published projects for MP3 decoders for silicon implementation (FPGA) and concluded that this one is the best available. In addition, the license they provide their work on is generous one.
Update: my research have shown that the most problem lies not in the input buffer (however it is possible to improve the situation by uploading 528 bytes of historical data to the decoder's buffer so that it would be able to grab main data from previous frame), but in the internal state of the decoder. Its documentation says:
To reduce resource usage, part of the RAM for buffering the intermediate data is also shared with Huffman decoding as bit reservoir ...
thus it is a contents of the reservoir and intermediate computed data affecting the decoding. I have confirmed it by starting various set of frames in different sequence, and if set of frames are played in different sequence, nature of garbage changes, or garbage may simply not appear.
Thus, unfortunately, my conclusion: it is not possible to properly seek using this decoder as is. I even do not think it is possible to "fake" playback (to quickly "play" the file till the needed point in buffers) as all three clocks are tied to each other.
I will keep my "best tested" implementation, with the notes on the quality.
Update 2: I was wrong, it is possible to seek softly, but to mitigate the sound corruption (yes, I am still unsure if I fixed it completely) I had to find another deficiency in the decoder: it is related to timing, decoder assumes that further data is always available in the buffer, while it may not be there yet. (It is actually clear from the test bench code supplied within the IP - the way data was replenished during QA and testing). In the cases I caught the corruption, first frames in the first part of the input buffer RAM were not decoded properly, skipped, and decoder quickly skips to second part of the RAM, assuming new data is there, however driving hardware is not ready yet fetching required data and putting this data into the second part of decoder's buffer RAM, thus corruption persisted for quite a long time with decoder looping skipping "invalid" frames until it catches correct image of the frame and normalizes its pace through the buffer.
Now the solution:
play (almost) 5 frames of silence through decoder before unmuting it. This will ensure all decoder's internal buffers are purged. It will not take much time, however requires some coding;
introduce a possibility to set huffman's decoder starting pointer readptr (in huffctl.v) after reset into the value other than 0. It will give the flexibility to have some history data uploaded into the decoder's buffer and start huffman decoder from the middle of the buffer rather than from its very start;
calculate the position to seek to, it calculates relatively easily for MPEG-1 Layer-3: duration=(filesize-ID3size)/(bitrate/8*1000), newPosition=ID3size+seekTime*(bitrate/8*1000). Duration is needed to check that position to seek to fits into the play time, alternatively newPosition can be used to check against file size. These forumlas do not take into account older tag versions appearing at the end of the file, but they are usually not more than 128 bytes, thus a kind of negligible for timing calculation relative to average MP3 sound file size; it also assumes CBR (VBR will require completely different way, requiring more power and data I/O for accurate seeking). Funny enough I found web pages with incorrect duration calculation formula, thus beware posts by ignorant people with cool job titles;
Seek to the calculated position, find next frame from this position on, calculate frame size, and ensure that there's next valid frame at that distance. New pointer will point to this next frame found at the distance;
find out the main_data_begin lookback pointer of the frame now being pointed to at step 4. Decrease the new pointer by this value so that pointer points within previous frame to the start of the main data for the current frame - it will be a pointer for the decoder data start. Note that it will fail if main data begins in more than one frame back (removal of headers of previous frame(s) will be required for proper operation);
fill decoder's buffer starting pointer identified in step 5, and set decoder's decoding start pointer to the one identified in step 4. While the implementation assumes you fill buffer in halves, do it different from the start: fill the whole buffer instead of just a first half. For this, after reset, set idle bit, check for data request, reset idle bit, perform two 1024 byte transfers to the decoder's buffer (effectively filling it completely), and then set idle bit, then reset it, and then set it again;
after performing step 7 continue normally replenishing 1024 bytes per decoder's request.
Employing this plan I had zero sound corruption cases. As you see it requires some changes to Verilog, but it must be easy if you know basics or hardware, know Verilog amd can perform reverse engineering.

C++ manage large buffer of images

I have a C++ application that reads and processes a video stream. I have two threads: one thread to read the stream and a second thread for processing. I access the stream with OpenCV VideoCapture and put frames (cv::Mat) in the readerwriterqueue buffer. From another thread, I read the frames from the buffer and process them.
Sometimes processing may take a lot of time and processing thread start to fall behind (while frames are put into the queue at the same speed). This increases the buffer size and may eventually take all available memory and hang the whole system. I know that Windows uses pagefile if there is not enough RAM, but the system still becomes pretty laggy. I need to make sure this won't happen.
I thought about setting a limit on the buffer size and offload frames to disk when the buffer is full. Then read them back in the queue when there is space. Would that work? Are there any good alternatives? How would one handle such a problem? Is my current approach (image queue) valid? Please advice.

Allocating a new buffer per each frame to prevent screen tearing

When I use the SDL library to set the pixel values in the memory and update the screen, screen tearing occurs whenever the update is too fast. I don't know much about the SDL internals, but my understanding from what I see is that:
The update function returns right after signalling the graphics hardware to read the pixel data from (say) buffer1.
The next frame is painted on buffer2, and update is called again, but this was too fast and the reading from buffer1 still hasn't completed;
My program doesn't know anything about the hardware and assumes that its okay to paint again in buffer1, while this buffer is being sent to the monitor.
The screen is torn.
This isn't a big problem when the velocity of the to-be-painted object is not too fast. The screen still tears, but it is almost non-visible to the human eye, but I'd still be happy if this tearing does not occur at all. I dislike vertical sync, as it produces consistent latency per each frame.
My idea is that probably a new screen buffer can be allocated per each frame to be painted on. When the monitor wants to display something, it should read from the newest buffer.
Is this a possible way already used in practice? If I do want to test my idea, what kind of low level and cross platform library or API I may use? SDL? OpenGL?
Do you think that updating the screen faster than the human eye can see it is productive? if you really must have your engine 100% independent of the retrace, use a triple buffer system. One buffer to display, and 2 buffers to update back and forth to until the screen is ready for the next buffer. Triple is as high as you need to go as if you fill the 2nd back buffer, you can just write over the now defunct 1st back buffer instead. No GPU lag and only 3 buffers.
Here is a nice link describing this technique along with some warnings about using it on modern GPUs...

What is fastest algorithm or method displaying line images from line scan camera

We have a line scan camera which produces 300 line images per second. We want to display the lines on a image view in the way of FIFO so that the last line of the view displays the most recent line image while shifting previous lines above for the line update.
If I can access video memory in C like old days, I would just do
memcpy(videoMem, videoMem+lineWidth*pixelSize, pixelSize*lineWidth*(nLines-1));
memcpy(videoMem+pixelSize*lineWidth*(nLines-1),newLine,lineWidth*pixelSize);
But I don't know if this is the best I can do even with direct video memory access.
Now I understand it's not possible nor desirable to access video memory directly. In that case, what is the best method? Any opinion from expert would be appreciated.
It is Desktop PC Application in Windows 7.
Update
As I expected, It seems that I have to deal with a kind of circular buffers. Tricky part in my case is that writing the buffer is line-by-line while reading is screen-by-screen. So end pointer reaches physical end of the buffer, additional memory copy is needed to pass the screen memory into video. I guess Bip buffer would be a solution for this. Any other idea?
You cannot memcpy memory that is overlapping, that is the purpose of memmove. Nevertheless, you can use memcpy as long as the copy occurs in the right order. Try it on your platform to see if it works.
The main implementation issue is if having two separate writes causes flicker. If this is the case, you have to write the new image to a buffer first and then write the entire buffer all at once to the video memory.
Generally speaking you don't read video memory. The data to be displayed should be in its own region of memory. Summing up you have 3 areas of memory:
data to be displayed
display buffer
video memory (or its equivalent)
The standard process is to write 1->2, then 2->3 in one step. If you have no flicker, however, you can go directly 1->3 with no buffer. Other than this, there is no magic algorithm beyond what you have written

Asynchronous readback from opengl front buffer using multiple PBO's

I am developing an application that needs to read back the whole frame from the front buffer of an openGL application. I can hijack the application's opengl library and insert my code on swapbuffers. At the moment I am successfully using a simple but excruciating slow glReadPixels command without PBO's.
Now I read about using multiple PBO's to speed things up. While I think I've found enough resources to actually program that (isn't that hard), I have some operational questions left. I would do something like this:
create a series (e.g. 3) of PBO's
use glReadPixels in my swapBuffers override to read data from front buffer to a PBO (should be fast and non-blocking, right?)
Create a seperate thread to call glMapBufferARB, once per PBO after a glReadPixels, because this will block until the pixels are in client memory.
Process the data from step 3.
Now my main concern is of course in steps 2 and 3. I read about glReadPixels used on PBO's being non-blocking, will this be an issue if I issue new opengl commands after that very fast? Will those opengl commands block? Or will they continue (my guess), and if so, I guess only swapbuffers can be a problem, will this one stall or will glReadPixels from front buffer be many times faster than swapping (about each 15->30ms) or, worst case scenario, will swapbuffers be executed while glReadPixels is still reading data to the PBO? My current guess is this logic will do something like this: copy FRONT_BUFFER -> generic place in VRAM, copy VRAM->RAM. But I have no idea which of those 2 is the real bottleneck and more, what the influence on the normal opengl command stream is.
Then in step 3. Is it wise to do this asynchronously in a thread separated from normal opengl logic? At the moment I think not, It seems you have to restore buffer operations to normal after doing this and I can't install synchronization objects in the original code to temporarily block those. So I think my best option is to define a certain swapbuffer delay before reading them out, so e.g. calling glReadPixels on PBO i%3 and glMapBufferARB on PBO (i+2)%3 in the same thread, resulting in a delay of 2 frames. Also, when I call glMapBufferARB to use data in client memory, will this be the bottleneck or will glReadPixels (asynchronously) be the bottleneck?
And finally, if you have some better ideas to speed up frame readback from GPU in opengl, please tell me, because this is a painful bottleneck in my current system.
I hope my question is clear enough, I know the answer will probably also be somewhere on the internet but I mostly came up with results that used PBO's to keep buffers in video memory and do processing there. I really need to read back the front buffer to RAM and I do not find any clear explanations about performance in that case (which I need, I cannot rely on "it's faster", I need to explain why it's faster).
Thank you
Are you sure you want to read from the front buffer? You do not own this buffer, and depending on your OS it might be destroyed, e.g., by another window on top of it.
For your use case, people typically do
draw N
start PBO read N from back buffer
draw N+1
start PBO read N+1
sync PBO read N
process N
...
from a single thread.